unipept / unipept-database

Makes database tables and indices for Unipept
MIT License
0 stars 2 forks source link

Improvements to the database construction pipeline #34

Open pverscha opened 1 year ago

pverscha commented 1 year ago

This issue describes a transformation we can apply to the input of the database construction pipeline such that the pipeline itself can be simplified and a significant performance improvement can be achieved.

See the pipeline overview here. This will help in understanding the transformation. We will be referring to the names of the different steps of this schematic in the description below.

Taxonomic analysis

  1. digest_proteins: During this step, in-silico tryptic digestion of all UniProtKB proteins is performed. After this digestion, we end up with a file called "peptides.tsv" that contains (amongst others), the following columns:
    • equalized peptide: A tryptic peptide where all occurrences of I are replaced by L.
    • original peptide: The original tryptic peptide (the first columns represents the same peptide, but with I = L).
    • functional annotations: A list of GO-terms, EC-numbers and InterPro-entries associated with this peptide.

Changes required: In this step, we are going to replace all occurrences of K in both peptide columns by Z and afterwards sort this file by original peptide. This transformation is what we call the "Z-trick" and will save us from having to sort the file again (multiple times) deeper in the pipeline. Something else we could add, is the taxon ID that's associated with each peptides (this can simply be an extra column in this peptides file), this information will be used in subsequent steps in the pipeline.

  1. join_equalized_pepts_and_entries or join_original_pepts_and_entries: In this step, all peptide sequences from the previous step are extracted (once for the equalized pepts and once for the original pepts) and are matched with taxon IDs. These files are then afterwards sorted. Since, at this point, we have added the taxon IDs in the previous step and we already sorted by original peptide, so both these steps are completely redundant and can be removed.

  2. number_sequences: In this step we are going to start from the peptides.tsv file generated by digest_proteins (and no longer from the files generated by join_equalized_pepts_and_entries and join_original_pepts_and_entries). Input to this step is actually simply the columns original peptides and equalized peptides from peptides.tsv that need to be merged together into a sorted list of all unique sequences in both these lists. Since, due to the Z-trick, both these columns are already sorted, we can easily convert these into one big sorted column in O(n) time, instead of O(n* log(n)) time. Afterwards, use cat -n to number the unique sequences and produce the sequences.tsv.gz file (such that it's identical to the current sequences.tsv.gz file).

  3. calculate_equalized_lcas and calculate_original_lcas: Instead of starting from aa_sequence_taxon_*.tsv.gz, we can start from the peptides.tsv.gz file produced by digest_proteins and obtain the same information that used to be present in aa_sequence_taxon_*.tsv.gz by simply extracting the columns original peptide and taxon ID (or equalized peptide). The rest of this step can remain the same.

Functional analysis

Right now, the functional analysis steps in the database construction pipeline are pretty confusing and repeat some of the work that has already been done as part of the taxonomic analysis. After taking an in-depth look at these steps, we can conclude that this whole functional analysis pipeline (indicated with the blue arrows in the schematic overview) can be replaced.

The "key insight" here is the fact that both the taxonomic and functional analysis steps simply need to "summarize" a collection of datapoints for equal peptide sequences. For the taxonomic analysis, we take as input all taxon IDs that belong to a unique peptide sequence and compute the LCA. For the functional analysis, we take as input a list of all functional annotations (both GO, EC and InterPro) and simply count how many times each annotation occurs.

Thus, after the number_sequences step is finished, we can directly perform the calculate_original_fas and calculate_equalized_fas steps by extracting the original peptide and functional information columns from the peptides.tsv.gz file and performing the same calculations as the current pipeline.

Parallellization

Once the number_sequences step has finished, the 4 steps calculate_original_lcas, calculate_equalized_lcas, calculate_original_fas and calculate_equalized_fas can all be computed in parallel (they all depend on the same input information) and do not need to wait for each other.

See this for an updated version of the pipeline schema :)

The Z-trick

In order to explain how the "Z-trick" (which makes all of these changes possible) works, it's best to take a look at a small example. The peptides.tsv file always contains two columns with a peptide sequence: one containing the original tryptic sequence, one where I is replaced by L (to simulate I = L). Now, if the first column is sorted, the other one is "nearly" sorted. Take a look at the example peptides below

original_peptide, equal_peptide
AILR, ALLR
AJLR, AJLR
AKLR, AKLR
AVLR, AVLR

In this example, the rows are sorted by the sequence in the first column. If we perform the substitution and replace all I's by L in the second column, you can see that the order is not completely right. Since the L comes after J and K after the alphabet, the order of the first line switches with the second and third (it should be `AJLR, AKLR, ALLR, AVLR).

Now, since there are only two letters between I and L in the alphabet (more specifically J and K), these are also the only letters that might cause this order conflict between the first and the second column. In metaproteomics, however, J will never occur in a protein (as this is not a valid identifier for an amino acid), so we don't need to worry about this one. K does occur, so that's the only case we should care about and this is where the "Z-trick" comes into play. Z is also not a valid identifier for amino acids, and can thus freely be used to by our application for other things. If we replace all occurrences of K by Z beforehand (for both columns), we are guaranteed that the second column is always sorted if the first one is sorted.

stijndcl commented 1 year ago

We found that this doesn't work as easily in practice. Consider the following rows:

Original Equalized
AAIS AALS
AALE AALE

when replacing I by L (as we always do), this is sorted correctly in the original order, but not the equalized order. It's more complicated than this to find one sort that does both of them at the same time.

ninewise commented 1 year ago

I was already amazed at the improvement, it would've been a great simplification. Perhaps it could still be interesting to try out a different sorting algorithm, though. Timsort is very good at nearly-sorted lists. On the other hand, you lose the easy of GNU sort.

stijndcl commented 1 year ago

I was already amazed at the improvement, it would've been a great simplification. Perhaps it could still be interesting to try out a different sorting algorithm, though. Timsort is very good at nearly-sorted lists. On the other hand, you lose the easy of GNU sort.

The dataset is quite big and won't fit in memory. Do you know of any decent implementations for Timsort that can handle large files? I feel like writing this myself from scratch is a bit out of scope for this thesis.

Right now this is my solution: