Replicate Oligo nt Sequence Behavior

@sminot pointed out an error occurring when running the virscan library. Unfortunately, this error is slightly more nuanced than a simple formatting error.

Running phipflow
Traceback (most recent call last):
File "/usr/local/bin/phipflow", line 33, in <module>
sys.exit(load_entry_point('phippery==0.1', 'console_scripts', 'phipflow')())
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click-8.0.3-py3.7.egg/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/phippery-0.1-py3.7.egg/phippery/cli.py", line 137, in load_from_counts_tsv
File "/usr/local/lib/python3.7/site-packages/phippery-0.1-py3.7.egg/phippery/phipdata.py", line 81, in stitch_dataset
File "/usr/local/lib/python3.7/site-packages/pandas-1.3.4-py3.7-linux-x86_64.egg/pandas/core/ops/common.py", line 69, in new_method
return method(self, other)
File "/usr/local/lib/python3.7/site-packages/pandas-1.3.4-py3.7-linux-x86_64.egg/pandas/core/arraylike.py", line 32, in __eq__
return self._cmp_method(other, operator.eq
File "/usr/local/lib/python3.7/site-packages/pandas-1.3.4-py3.7-linux-x86_64.egg/pandas/core/indexes/base.py", line 6042, in _cmp_method
raise ValueError("Lengths must match to compare")
ValueError: Lengths must match to compare

TLDR

If you want oligo sequence replicates to have the exact same alignment counts, one simple solution is to:

reset the peptide_id to have a unique int for each row.
add -a parameter to the alignment options

More Detail

This really has to do with repeat oligonucleotide sequences in the peptide libraries, and how we should expect the pipeline to report counts for them. The error is caused because in collect_phip_data processing step, the phippery python API first merges the counts files (i.e. the two tsv's coming from sam_to_counts idxstats command) and then expects the merged count's (now an enrichment matrix) shape to be exactly ( len(peptide_table) , len(sample_table) ). In the case of your example data, the counts files have exactly 115753 entries each, and the original vir3 peptide table has 128257 entries.

This discrepancy makes sense because the formatting of the current virscan lib made peptide id's a one-to-one mapping with unique Oligo sequences -- which is a very reasonable thing to do. However, the pipeline currently uses those peptide id's to create headers in the fasta file for building an index and later retrieving alignment counts statistics for each id. It does this in a naive way meaning each row in the peptide table gets a fasta entry. The index which is built from this fasta then treats those as identical, and only reports for a single of those unique id's.

Options for handling repeat sequences

Option 1 - The counts will be equal (replicated) for each sequence which is identical when a single alignment is reported. At first this may seem like the desired behavior because (I assume - but could be wrong) sequences are de-duplicated before the Oligo sequences are synthesized into a library. So, I guess one would think that b/c the sequence is homogeneous across certain strains, we should count it as a binding event for all peptide tile homologs in the face of ambiguity.

small note: Additionally one thing that comes to mind with this approach is the possibility of functional homologs? maybe this is dealt with as well during library design.

Option 2 - Another opinion that has been expressed is that we should be randomly distributing the number of measured antibody-peptide binding events (i.e. alignments) across the sequence replicates when evaluating the data. In other words, one alignment to one count of enrichment across the entire library. I guess the argument is that these measurements for any one peptide in the assay are not independent and thus, we don't want to artificially conflate counts for any singular measurements. This is the approach we have taken so far with the Overbaugh lab. However, I haven't really had to dig in to this behavior with these smaller "artisan" phage libraries b/c we have low abundance of sequence repeats. For reference, The Virscan library has a mean of 1.1 and a max of 53, sequence replicates in the library.

Current defaults & potential bug

The defaults for both Bowtie and Bowtie2 are:

The process by which bowtie chooses an alignment to report is randomized in order to avoid “mapping bias” - the phenomenon whereby an aligner systematically fails to report a particular class of good alignments, causing spurious “holes” in the comparative assembly. Whenever bowtie reports a subset of the valid alignments that exist, it makes an effort to sample them randomly. This randomness flows from a simple seeded pseudo-random number generator and is deterministic in the sense that Bowtie will always produce the same results for the same read when run with the same initial “seed” value (see --seed option).

Bowtie 2's search for alignments for a given read is "randomized." That is, when Bowtie 2 encounters a set of equally-good choices, it uses a pseudo-random number to choose. For example, if Bowtie 2 discovers a set of 3 equally-good alignments and wants to decide which to report, it picks a pseudo-random integer 0, 1 or 2 and reports the corresponding alignment. Arbitrary choices can crop up at various points during alignment.

They both seem to suggest that we should expect the alignments to one replicate should be randomly distributed across the replicate entries. I though this would be the case since they would be "equally-good" choices. After running the pipeline on some example simulations I actually found that we have been assigning the count always to the last replicate sequence in the fasta. I'm really not sure why this is the case. However, when using the -a parameter and resetting the index we get the behavior or replicate counts for replicate sequences.

Looking forward to hearing thoughts on this. (cc @ksung25)

Thanks for the very complete and helpful explanation of this issue, @jgallowa07!

My interpretation of the formatting of the VirScan library is that duplicate sequences are included so that identical regions of different viruses can be counted for each of the viruses. Based on this interpretation, I would recommend going with Option 1, as you suggest. I can think of some rare applications where Option 2 may be more appropriate, but it would probably require further research to determine whether that approach is valid even in those cases.

After reading over your summary, it sounds like it might actually be pretty straightforward to support duplicated sequences via Option 1. Does that seem like a feature that you'd be able to add without too much trouble?

I am in favor of Option 2 (assuming this is handled by Bowtie as stated in their documentation).

I think it's useful to have sum of peptide counts equaling total reads mapped for the sample, for cross checking purposes and for other quantities that we may come up with over the course of an analysis.
While the issue is identical oligo sequences appearing multiple times in a peptide library design, we often do run Bowtie alignment allowing for mismatches. This means a read could match equally well to several peptide IDs that may not be identical oligo sequences. In Option 2, this is pretty much random noise over our peptide counts. For Option 1, I'm worried we might get enhanced noise or some artifacts because all those peptide IDs will have their counts incremented. (Or maybe this is an argument to align without mismatches...)
We could always aggregate the peptide IDs of identical oligo sequences after alignment, and therefore obtain the result of Option 1.

Considering equally-good alignments with mismatches I hadn't considered the unintended behavior of the -a flag with alignments that tolerate mismatches. However, I think it is worth noting that the behavior noted by @jgallowa07 is that only the sequence with the highest index is likely to have the hit in this case. And so if a random apportioning of hits is desired for assigning equally-good alignments to non-identical sequences (as I agree would make sense), then this may need to be addressed by using the -a flag and then parsing the output to randomly apportion the alignments for each read.

Considering VirScan Switching topics now from the equally-good-mismatched alignments back to the VirScan library. I believe that the most biologically appropriate behavior for the VirScan library would be to duplicate the counts across all epitopes which are shared exactly by viruses. Given the complexities of using -a noted above, I am now more in favor of the approach where this behavior is accomplished directly by the phip-flow code: first deduplicating the sequences before alignment, and then re-duplicating the counts after alignment.

re Considering equally-good alignments with mismatches

It's interesting. With the simulations I wrote here, the highest index peptide id seems to be assigned the counts for all samples. For example. I have a randomly generated peptide library with 10 unique peptides - and then id's 9 through 12 are all exact replicates. Each of the simulated samples has exactly 10 reads to match with each of the unique peptides. There are 12 such samples. After alignment as we've been doing it we get the enrichment matrix (13 peptides X 12 samples):

>>> ds = phippery.load("sim-example-output/sim-example-replicates-4.csv.phip")
>>> ds.counts
<xarray.DataArray 'counts' (peptide_id: 13, sample_id: 12)>
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Coordinates:
  * sample_id   (sample_id) int64 0 1 2 3 4 5 6 7 8 9 10 11
  * peptide_id  (peptide_id) int64 0 1 2 3 4 5 6 7 8 9 10 11 12

This certainly seems like bowtie wants to choose the highest index repeat for the reported alignment - maybe for each new alignment we're starting with the same seed or something? Because looking at our empirical data (this is the HIV Phage-DMS library) - it seems the counts are randomly dispersed. Here are the counts for two identical peptides across all samples

>>> dso.counts
<xarray.DataArray 'counts' (peptide_id: 2, sample_id: 120)>
array([[ 3,  0,  1,  0,  0,  0,  0,  0,  5,  7,  4,  1,  1,  1,  1,  1,
         0,  0,  4,  1,  1,  0,  0,  0,  3,  3,  0,  4,  0,  8,  6,  4,
         1,  1,  0,  5,  0,  1,  0,  0,  3,  2,  0,  2,  0,  0,  2,  0,
         2,  1,  1,  2,  2,  0,  1,  1,  0,  2,  0,  0,  3,  0,  2,  3,
         4,  4,  2,  0,  3,  2,  0,  0,  1,  0,  0,  0,  0,  1,  3,  1,
         1,  2,  0,  3,  0,  1,  0,  0,  1,  4,  1,  0,  1,  9,  1,  1,
         0,  3,  3,  0,  4,  0,  5,  0,  2,  1,  0,  0,  5,  0,  1,  2,
         2,  2,  3,  0,  2,  1,  1,  0],
       [ 0,  0,  1,  1,  1,  3,  0,  0,  4,  4,  7,  2,  0,  4,  1,  4,
         0,  3,  0,  1,  1,  1,  0,  0,  0,  5,  3,  2,  0, 12,  6,  6,
         0,  0,  1,  4,  0,  1,  2,  1,  2,  5,  0,  2,  1,  0,  2,  1,
         1,  4,  0,  0,  4,  0,  0,  2,  1,  1,  0,  1,  2,  0,  1,  4,
         0,  1,  2,  4,  4,  3,  2,  1,  0,  0,  0,  0,  0,  2,  4,  3,
         0,  3,  3,  4,  0,  0,  0,  0,  1,  1,  0,  0,  3,  5,  1,  1,
         3,  1,  0,  0,  2,  0,  3,  0,  3,  0,  1,  0,  3,  0,  2,  2,
         0,  2,  5,  0,  2,  2,  3,  2]])
Coordinates:
  * sample_id   (sample_id) int64 0 1 4 5 8 9 12 ... 318 320 322 324 326 328 330
  * peptide_id  (peptide_id) int64 1034 4494
>>> dso.counts.sum()
<xarray.DataArray 'counts' ()>
array(378)
>>> dso.counts.sum(axis=1)
<xarray.DataArray 'counts' (peptide_id: 2)>
array([179, 199])

I'm going to assume my simulations are off and assume bowtie acts as expected for now unless there are further thoughts.

Considering VirScan

I am now more in favor of the approach where this behavior is accomplished directly by the phip-flow code: first deduplicating the sequences before alignment, and then re-duplicating the counts after alignment.

I agree! To make sure I understand correctly, I've described this behavior more concretely in https://github.com/matsengrp/phippery/issues/137

matsengrp / phip-flow