Closed stijndcl closed 10 months ago
Out of curiosity: if you're porting this to rust as well, why don't just
call umgap joinkmers taxons.tsv
(cut
ting out the right columns from
the TSV input)? I've replaced the java command with this before (to try
out non-LCA indexes) and it works quite well.
@stijndcl: @BramDevlaminck has already used @ninewise's Rust implementations of the LCA (and alternatives such as LCA*) in his attempts for an improved Unipept index; feel free to check with him
@ninewise: FYI: @BramDevlaminck is implementing alternative index structures in Rust (suffix tree, (compressed/sparse) sufix array, (bi-directional) FM-index) to see if we could replace the current Unipept index such that i) non-tryptic peptides can be found (including tryptic peptides that suffer from miscleavages) and ii) peptides can be found through inexact matching instead of exact matching, without compromising speed (or even improving speed) for lookups and keeping memory within limits of our machines. We still precompute aggregations (currently LCA* as we did for umgap) at all internal nodes of the index structures for performance reasons. We already finished implementation and benchmarking of suffix trees for SwissProt, with extremly good (fast) results. Now we need to scale-up x500 in size (where memory will become the bottleneck), so this is why were moving on to the other more memory-friendly index structures.
Out of curiosity: if you're porting this to rust as well, why don't just call
umgap joinkmers taxons.tsv
(cut
ting out the right columns from the TSV input)? I've replaced the java command with this before (to try out non-LCA indexes) and it works quite well.
Because I didn't know that existed :) I'll give it a shot later this week and see if it outperforms mine.
@pdawyndt I'm actually using LCA and not LCA. I discussed this with @pverscha. The reason for this is that applying LCA to the leaves vs applying it directly on the descendants gave a different result.
Here is a minimal example that shows where a difference occurs (this is part of the first version of my thesis):
Thanks for the FYI, @pdawyndt; I'll be asking for a copy to read through at the end of the year, if that's OK.
@BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was 9606 10566 9606
, but then why are they structured in a tree like this? Or is this the tree of operations, something like lca*(lca*(9606, 10566), 9606)
?
@ninewise
@BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was
9606 10566 9606
, but then why are they structured in a tree like this? Or is this the tree of operations, something likelca*(lca*(9606, 10566), 9606)
?
The original input are the values of the leaves, so indeed 9606, 10566 and 9606
Figure (a) performs the LCA on the leaves of the subtree. So for the smaller left subtree it computes `lca(9606, 10566), and for the root of the tree
lca*(9606, 10566, 9606)`.
Figure (b) performs the LCA on the values on the tree of operations, so indeed `lca(lca(9606, 10566), 9606)for the root, and
lca(9606, 10566)` for the left subtree.
@ninewise @stijndcl I just also thought about the fact that the filter that's present in the current database construction pipeline should still be taken into account. I don't think that UMGAP follows the same exclusion rules as Unipept does at this point? I'm talking about the filtering that takes place in the validate
function: https://github.com/unipept/unipept-database/blob/fb0c554bcb442bbb061d593e09106e053e3d3b46/scripts/helper_scripts/parser/src/taxons/TaxonList.java#L81C5-L116C1
Bram Devlaminck @.***> wrote:
@BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was
9606 10566 9606
, but then why are they structured in a tree like this? Or is this the tree of operations, something likelca*(lca*(9606, 10566), 9606)
?The original input are the values of the leaves, so indeed 9606, 10566 and 9606
Figure (a) performs the LCA on the leaves of the subtree. So for the smaller left subtree it computes `lca(9606, 10566)
, and for the root of the tree
lca*(9606, 10566, 9606)`.Figure (b) performs the LCA on the values on the tree of operations, so indeed `lca(lca(9606, 10566), 9606)
for the root, and
lca(9606, 10566)` for the left subtree.
Right, makes more sense now. LCA* does indeed not work for a simple pairwise aggregation, unless you additionally store the "depth" of the last merge. You can check this in code here or read the discussion of this algorithm in Tom's master thesis (I don't have a copy here, but Peter and Bart probably do).
Pieter Verschaffelt @.***> wrote:
@ninewise @stijndcl I just also thought about the fact that the filter that's present in the current database construction pipeline should still be taken into account. I don't think that UMGAP follows the same exclusion rules as Unipept does at this point? I'm talking about the filtering that takes place in the
validate
function: https://github.com/unipept/unipept-database/blob/fb0c554bcb442bbb061d593e09106e053e3d3b46/scripts/helper_scripts/parser/src/taxons/TaxonList.java#L81C5-L116C1
It should: it takes the taxons.tsv table as input and uses the "valid" column therein.
@ninewise I gave umgap
a go and joinkmers
needs 86 seconds whereas my implementation only needs 22. The difference is quite substantial so I'd propose we use mine instead.
I believe part of the difference could be that umgap
operates on the taxons
file, but my tool uses the lineages
that we build in an earlier stage of the pipeline.
Well, just using the taxons file probably wouldn't be that much slower
(I think), but joinkmers
also does a tree-based LCA (actually
hybrid 95%) rather than a table-based LCA, which apparently makes
quite a difference in performance. I definitely agree going with your
implementation.
Re-wrote the LCA Java tool in Rust, taking into account all comments on my previous PR.