unipept / unipept-database

Makes database tables and indices for Unipept
MIT License
0 stars 2 forks source link

Rust LCA #32

Closed stijndcl closed 9 months ago

stijndcl commented 11 months ago

Re-wrote the LCA Java tool in Rust, taking into account all comments on my previous PR.

ninewise commented 11 months ago

Out of curiosity: if you're porting this to rust as well, why don't just call umgap joinkmers taxons.tsv (cutting out the right columns from the TSV input)? I've replaced the java command with this before (to try out non-LCA indexes) and it works quite well.

pdawyndt commented 11 months ago

@stijndcl: @BramDevlaminck has already used @ninewise's Rust implementations of the LCA (and alternatives such as LCA*) in his attempts for an improved Unipept index; feel free to check with him

@ninewise: FYI: @BramDevlaminck is implementing alternative index structures in Rust (suffix tree, (compressed/sparse) sufix array, (bi-directional) FM-index) to see if we could replace the current Unipept index such that i) non-tryptic peptides can be found (including tryptic peptides that suffer from miscleavages) and ii) peptides can be found through inexact matching instead of exact matching, without compromising speed (or even improving speed) for lookups and keeping memory within limits of our machines. We still precompute aggregations (currently LCA* as we did for umgap) at all internal nodes of the index structures for performance reasons. We already finished implementation and benchmarking of suffix trees for SwissProt, with extremly good (fast) results. Now we need to scale-up x500 in size (where memory will become the bottleneck), so this is why were moving on to the other more memory-friendly index structures.

stijndcl commented 11 months ago

Out of curiosity: if you're porting this to rust as well, why don't just call umgap joinkmers taxons.tsv (cutting out the right columns from the TSV input)? I've replaced the java command with this before (to try out non-LCA indexes) and it works quite well.

Because I didn't know that existed :) I'll give it a shot later this week and see if it outperforms mine.

BramDevlaminck commented 11 months ago

@pdawyndt I'm actually using LCA and not LCA. I discussed this with @pverscha. The reason for this is that applying LCA to the leaves vs applying it directly on the descendants gave a different result.

Here is a minimal example that shows where a difference occurs (this is part of the first version of my thesis):

image
ninewise commented 11 months ago

Thanks for the FYI, @pdawyndt; I'll be asking for a copy to read through at the end of the year, if that's OK.

@BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was 9606 10566 9606, but then why are they structured in a tree like this? Or is this the tree of operations, something like lca*(lca*(9606, 10566), 9606)?

BramDevlaminck commented 11 months ago

@ninewise

@BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was 9606 10566 9606, but then why are they structured in a tree like this? Or is this the tree of operations, something like lca*(lca*(9606, 10566), 9606)?

The original input are the values of the leaves, so indeed 9606, 10566 and 9606

Figure (a) performs the LCA on the leaves of the subtree. So for the smaller left subtree it computes `lca(9606, 10566), and for the root of the treelca*(9606, 10566, 9606)`.

Figure (b) performs the LCA on the values on the tree of operations, so indeed `lca(lca(9606, 10566), 9606)for the root, andlca(9606, 10566)` for the left subtree.

pverscha commented 11 months ago

@ninewise @stijndcl I just also thought about the fact that the filter that's present in the current database construction pipeline should still be taken into account. I don't think that UMGAP follows the same exclusion rules as Unipept does at this point? I'm talking about the filtering that takes place in the validate function: https://github.com/unipept/unipept-database/blob/fb0c554bcb442bbb061d593e09106e053e3d3b46/scripts/helper_scripts/parser/src/taxons/TaxonList.java#L81C5-L116C1

ninewise commented 11 months ago

Bram Devlaminck @.***> wrote:

@BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was 9606 10566 9606, but then why are they structured in a tree like this? Or is this the tree of operations, something like lca*(lca*(9606, 10566), 9606)?

The original input are the values of the leaves, so indeed 9606, 10566 and 9606

Figure (a) performs the LCA on the leaves of the subtree. So for the smaller left subtree it computes `lca(9606, 10566), and for the root of the treelca*(9606, 10566, 9606)`.

Figure (b) performs the LCA on the values on the tree of operations, so indeed `lca(lca(9606, 10566), 9606)for the root, andlca(9606, 10566)` for the left subtree.

Right, makes more sense now. LCA* does indeed not work for a simple pairwise aggregation, unless you additionally store the "depth" of the last merge. You can check this in code here or read the discussion of this algorithm in Tom's master thesis (I don't have a copy here, but Peter and Bart probably do).

Pieter Verschaffelt @.***> wrote:

@ninewise @stijndcl I just also thought about the fact that the filter that's present in the current database construction pipeline should still be taken into account. I don't think that UMGAP follows the same exclusion rules as Unipept does at this point? I'm talking about the filtering that takes place in the validate function: https://github.com/unipept/unipept-database/blob/fb0c554bcb442bbb061d593e09106e053e3d3b46/scripts/helper_scripts/parser/src/taxons/TaxonList.java#L81C5-L116C1

It should: it takes the taxons.tsv table as input and uses the "valid" column therein.

stijndcl commented 11 months ago

@ninewise I gave umgap a go and joinkmers needs 86 seconds whereas my implementation only needs 22. The difference is quite substantial so I'd propose we use mine instead.

I believe part of the difference could be that umgap operates on the taxons file, but my tool uses the lineages that we build in an earlier stage of the pipeline.

ninewise commented 11 months ago

Well, just using the taxons file probably wouldn't be that much slower (I think), but joinkmers also does a tree-based LCA (actually hybrid 95%) rather than a table-based LCA, which apparently makes quite a difference in performance. I definitely agree going with your implementation.