Open pverscha opened 1 year ago
We found that this doesn't work as easily in practice. Consider the following rows:
Original | Equalized |
---|---|
AAIS |
AALS |
AALE |
AALE |
when replacing I
by L
(as we always do), this is sorted correctly in the original order, but not the equalized order. It's more complicated than this to find one sort that does both of them at the same time.
I was already amazed at the improvement, it would've been a great simplification. Perhaps it could still be interesting to try out a different sorting algorithm, though. Timsort is very good at nearly-sorted lists. On the other hand, you lose the easy of GNU sort.
I was already amazed at the improvement, it would've been a great simplification. Perhaps it could still be interesting to try out a different sorting algorithm, though. Timsort is very good at nearly-sorted lists. On the other hand, you lose the easy of GNU sort.
The dataset is quite big and won't fit in memory. Do you know of any decent implementations for Timsort that can handle large files? I feel like writing this myself from scratch is a bit out of scope for this thesis.
Right now this is my solution:
This issue describes a transformation we can apply to the input of the database construction pipeline such that the pipeline itself can be simplified and a significant performance improvement can be achieved.
See the pipeline overview here. This will help in understanding the transformation. We will be referring to the names of the different steps of this schematic in the description below.
Taxonomic analysis
Changes required: In this step, we are going to replace all occurrences of K in both peptide columns by Z and afterwards sort this file by original peptide. This transformation is what we call the "Z-trick" and will save us from having to sort the file again (multiple times) deeper in the pipeline. Something else we could add, is the taxon ID that's associated with each peptides (this can simply be an extra column in this peptides file), this information will be used in subsequent steps in the pipeline.
join_equalized_pepts_and_entries or join_original_pepts_and_entries: In this step, all peptide sequences from the previous step are extracted (once for the equalized pepts and once for the original pepts) and are matched with taxon IDs. These files are then afterwards sorted. Since, at this point, we have added the taxon IDs in the previous step and we already sorted by original peptide, so both these steps are completely redundant and can be removed.
number_sequences: In this step we are going to start from the
peptides.tsv
file generated bydigest_proteins
(and no longer from the files generated byjoin_equalized_pepts_and_entries
andjoin_original_pepts_and_entries
). Input to this step is actually simply the columnsoriginal peptides
andequalized peptides
frompeptides.tsv
that need to be merged together into a sorted list of all unique sequences in both these lists. Since, due to the Z-trick, both these columns are already sorted, we can easily convert these into one big sorted column in O(n) time, instead of O(n* log(n)) time. Afterwards, usecat -n
to number the unique sequences and produce thesequences.tsv.gz
file (such that it's identical to the currentsequences.tsv.gz
file).calculate_equalized_lcas and calculate_original_lcas: Instead of starting from
aa_sequence_taxon_*.tsv.gz
, we can start from thepeptides.tsv.gz
file produced bydigest_proteins
and obtain the same information that used to be present inaa_sequence_taxon_*.tsv.gz
by simply extracting the columnsoriginal peptide
andtaxon ID
(orequalized peptide
). The rest of this step can remain the same.Functional analysis
Right now, the functional analysis steps in the database construction pipeline are pretty confusing and repeat some of the work that has already been done as part of the taxonomic analysis. After taking an in-depth look at these steps, we can conclude that this whole functional analysis pipeline (indicated with the blue arrows in the schematic overview) can be replaced.
The "key insight" here is the fact that both the taxonomic and functional analysis steps simply need to "summarize" a collection of datapoints for equal peptide sequences. For the taxonomic analysis, we take as input all taxon IDs that belong to a unique peptide sequence and compute the LCA. For the functional analysis, we take as input a list of all functional annotations (both GO, EC and InterPro) and simply count how many times each annotation occurs.
Thus, after the
number_sequences
step is finished, we can directly perform thecalculate_original_fas
andcalculate_equalized_fas
steps by extracting theoriginal peptide
andfunctional information
columns from thepeptides.tsv.gz
file and performing the same calculations as the current pipeline.Parallellization
Once the
number_sequences
step has finished, the 4 stepscalculate_original_lcas
,calculate_equalized_lcas
,calculate_original_fas
andcalculate_equalized_fas
can all be computed in parallel (they all depend on the same input information) and do not need to wait for each other.See this for an updated version of the pipeline schema :)
The Z-trick
In order to explain how the "Z-trick" (which makes all of these changes possible) works, it's best to take a look at a small example. The
peptides.tsv
file always contains two columns with a peptide sequence: one containing the original tryptic sequence, one where I is replaced by L (to simulate I = L). Now, if the first column is sorted, the other one is "nearly" sorted. Take a look at the example peptides belowIn this example, the rows are sorted by the sequence in the first column. If we perform the substitution and replace all I's by L in the second column, you can see that the order is not completely right. Since the
L
comes afterJ
andK
after the alphabet, the order of the first line switches with the second and third (it should be `AJLR, AKLR, ALLR, AVLR).Now, since there are only two letters between
I
andL
in the alphabet (more specificallyJ
andK
), these are also the only letters that might cause this order conflict between the first and the second column. In metaproteomics, however,J
will never occur in a protein (as this is not a valid identifier for an amino acid), so we don't need to worry about this one.K
does occur, so that's the only case we should care about and this is where the "Z-trick" comes into play.Z
is also not a valid identifier for amino acids, and can thus freely be used to by our application for other things. If we replace all occurrences ofK
byZ
beforehand (for both columns), we are guaranteed that the second column is always sorted if the first one is sorted.