steineggerlab / Metabuli

Metabuli: specific and sensitive metagenomic classification via joint analysis of DNA and amino acid.
GNU General Public License v3.0
118 stars 10 forks source link

Inconsistency of read assignments #12

Closed sjaenick closed 1 year ago

sjaenick commented 1 year ago

Hi @JaebeomKim0731and @martin-steinegger,

I noticed something unexpected - read classifications are different between runs ~depending on the --threads parameter~:

metabuli classify --seq-mode 1 testread.fas /vol/biodb/local_databases/MGX/metabuli/refseq/ outdir10 jobid --threads 10

metabuli classify --seq-mode 1 testread.fas /vol/biodb/local_databases/MGX/metabuli/refseq/ outdir20 jobid --threads 20 testread.fas.txt

$ cat outdir10/jobid_classifications.tsv outdir20/jobid_classifications.tsv 
1       testseq 1501268 183     0.606557        0.622951        2       species 1501268:18 
1       testseq 1219    183     0.631148        0.672131        5       species 74546:1 167546:2 1219:24

This seems to happen quite frequently, e.g. for a test dataset with 100.000 sequences, I see over 4.000 differences in the output. Can you try to reproduce this?

sjaenick commented 1 year ago

Update: Even with the same number of threads, there are output differences between runs. This seems very odd.

jaebeom-kim commented 1 year ago

Thanks again. I could replace the difference between runs with --threads 10 and '--threads 20'. In the printed log, The number of matches' were different, so it seems there is a bug inlinearSearchParallel()`.

For the inconsistency with the same threads number, I couldn't reproduce it yet. Did you use the same input file?

sjaenick commented 1 year ago

Try this one. 100k.fas.txt

jaebeom-kim commented 1 year ago

The JobID_classifications.tsv file can be different by runs without any errors. It is because the order of taxa in the last column can be differ by runs.

The inconsistency with the same threads number seems because of it. I repeated four times with 100k.fas.txt, but the third column (tax id column) were identical. Could you compare your _classifications.tsv files after running awk '{print $3}'? And I guess that _report.tsv are the same across runs with the same number of threads.

sjaenick commented 1 year ago

The JobID_classifications.tsv file can be different by runs without any errors. It is because the order of taxa in the last column can be differ by runs.

The inconsistency with the same threads number seems because of it. I repeated four times with 100k.fas.txt, but the third column (tax id column) were identical. Could you compare your _classifications.tsv files after running awk '{print $3}'? And I guess that _report.tsv are the same across runs with the same number of threads.

You are correct - its always the last column that differs in their order (maybe this should be sorted for better reproducibility?); the reports are identical.

jaebeom-kim commented 1 year ago

Thank you for your double-check! I'm currently trying to figure out the inconsistency between runs depending on the --threads parameter.

I agree with your suggestion for the last column :)

jaebeom-kim commented 1 year ago

I found that the RefSeq database is broken. I built and upload again. Could you test with the newer DB?

I found long stretch of unintended 0s in the old DB. Depending on the number of threads, the 0s are accessed differently or not accessed, inducing the inconsistency.

The new DB is built using the released Metabuli v1.0.0.

sjaenick commented 1 year ago

I found that the RefSeq database is broken. I built and upload again. Could you test with the newer DB?

Thanks, that seems to fix it! After downloading the fixed database, _reports.tsv are identical and _classifications.tsv only differ in the last column (see my other PR for that).

jaebeom-kim commented 1 year ago

Great! Thank you for helping us.