Closed sjaenick closed 1 year ago
Update: Even with the same number of threads, there are output differences between runs. This seems very odd.
Thanks again.
I could replace the difference between runs with --threads 10
and '--threads 20'.
In the printed log, The number of matches' were different, so it seems there is a bug in
linearSearchParallel()`.
For the inconsistency with the same threads number, I couldn't reproduce it yet. Did you use the same input file?
Try this one. 100k.fas.txt
The JobID_classifications.tsv file can be different by runs without any errors. It is because the order of taxa in the last column can be differ by runs.
The inconsistency with the same threads number seems because of it.
I repeated four times with 100k.fas.txt, but the third column (tax id column) were identical.
Could you compare your _classifications.tsv
files after running awk '{print $3}'
?
And I guess that _report.tsv
are the same across runs with the same number of threads.
The JobID_classifications.tsv file can be different by runs without any errors. It is because the order of taxa in the last column can be differ by runs.
The inconsistency with the same threads number seems because of it. I repeated four times with 100k.fas.txt, but the third column (tax id column) were identical. Could you compare your
_classifications.tsv
files after runningawk '{print $3}'
? And I guess that_report.tsv
are the same across runs with the same number of threads.
You are correct - its always the last column that differs in their order (maybe this should be sorted for better reproducibility?); the reports are identical.
Thank you for your double-check! I'm currently trying to figure out the inconsistency between runs depending on the --threads parameter.
I agree with your suggestion for the last column :)
I found that the RefSeq database is broken. I built and upload again. Could you test with the newer DB?
I found long stretch of unintended 0s in the old DB. Depending on the number of threads, the 0s are accessed differently or not accessed, inducing the inconsistency.
The new DB is built using the released Metabuli v1.0.0.
I found that the RefSeq database is broken. I built and upload again. Could you test with the newer DB?
Thanks, that seems to fix it! After downloading the fixed database, _reports.tsv are identical and _classifications.tsv only differ in the last column (see my other PR for that).
Great! Thank you for helping us.
Hi @JaebeomKim0731and @martin-steinegger,
I noticed something unexpected - read classifications are different between runs ~depending on the
--threads
parameter~:metabuli classify --seq-mode 1 testread.fas /vol/biodb/local_databases/MGX/metabuli/refseq/ outdir10 jobid --threads 10
metabuli classify --seq-mode 1 testread.fas /vol/biodb/local_databases/MGX/metabuli/refseq/ outdir20 jobid --threads 20
testread.fas.txtThis seems to happen quite frequently, e.g. for a test dataset with 100.000 sequences, I see over 4.000 differences in the output. Can you try to reproduce this?