steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
693 stars 91 forks source link

`--taxon-list` option not accepting multiple taxonomy IDs correctly #259

Closed yonesora56 closed 2 months ago

yonesora56 commented 3 months ago

Expected Behavior

I am executing the following command against the AlphaFold/UniProt database. As described in the README, I am also including the options --sort-by-structure-bits 0 and --prefilter-mode 1 to reduce the required amount of RAM.

foldseek easy-search \
mmCIFfile/AF-Q657Z2-F1-model_v4.cif \
./uniprot \
./Q657Z2_result.tsv \
./tmp \
--sort-by-structure-bits 0 \
-e 0.01 \
--format-mode 4 \
--format-output query,target,taxid,taxname,qcov,tcov,evalue,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,lddt,qtmscore,ttmscore,alntmscore,rmsd,prob,cigar \
--threads 10 \
--prefilter-mode 1 \ 
--taxon-list 9606,10090,3702,4577

I was expecting to retrieve results for the specified taxonomy list.

 

Current Behavior

In addition to the specified taxonomy IDs (9606, 10090, 3702, 4577), hits from other species (e.g., 2382222 Roseomonas wenyumeiae) were also found.

query   target  taxid   taxname qcov    tcov    evalue  fident  alnlen  mismatch    gapopen qstart  qend    tstart  tend    lddt    qtmscore    ttmscore    alntmscore  rmsd    prob    cigar
AF-Q657Z2-F1-model_v4.cif   AF-K7VP78-F1-model_v4   4577    Zea mays    1.000   0.699   2.793E-20   0.712   153 29  1   1   138 67  219 9.107E-01   8.005E-01   5.133E-01   5.133E-01   8.714E+00   1.000   22M15D116M
AF-Q657Z2-F1-model_v4.cif   AF-C0P6E8-F1-model_v4   4577    Zea mays    0.848   0.900   3.857E-20   0.897   117 12  0   22  138 14  130 9.121E-01   7.442E-01   7.872E-01   7.872E-01   2.693E+00   1.000   117M
AF-Q657Z2-F1-model_v4.cif   AF-B6SL54-F1-model_v4   4577    Zea mays    0.841   0.892   4.389E-20   0.905   116 11  0   23  138 15  130 9.404E-01   7.278E-01   7.718E-01   7.718E-01   4.266E+00   1.000   116M
AF-Q657Z2-F1-model_v4.cif   AF-B6TBM8-F1-model_v4   4577    Zea mays    0.841   0.892   1.084E-19   0.905   116 11  0   23  138 15  130 9.373E-01   7.250E-01   7.689E-01   7.689E-01   4.571E+00   1.000   116M
AF-Q657Z2-F1-model_v4.cif   AF-B6T2F8-F1-model_v4   4577    Zea mays    0.841   0.892   1.156E-19   0.913   116 10  0   23  138 15  130 9.328E-01   7.253E-01   7.692E-01   7.692E-01   4.482E+00   1.000   116M
AF-Q657Z2-F1-model_v4.cif   AF-Q8LBM4-F1-model_v4   3702    Arabidopsis thaliana    0.964   0.971   3.044E-19   0.676   133 43  0   6   138 2   134 9.313E-01   8.140E-01   8.197E-01   8.197E-01   6.870E+00   1.000   133M
AF-Q657Z2-F1-model_v4.cif   AF-Q0WSL3-F1-model_v4   3702    Arabidopsis thaliana    1.000   1.000   1.038E-18   0.687   141 37  2   1   138 1   137 9.059E-01   7.583E-01   7.637E-01   7.637E-01   1.306E+01   1.000   17M4I110M3D7M
AF-Q657Z2-F1-model_v4.cif   AF-F4ILA9-F1-model_v4   3702    Arabidopsis thaliana    1.000   0.979   1.107E-18   0.645   144 41  2   1   138 1   140 8.912E-01   7.229E-01   6.987E-01   6.987E-01   1.505E+01   1.000   17M4I46M6D71M
AF-Q657Z2-F1-model_v4.cif   AF-A0A178VYA7-F1-model_v4   3702    Arabidopsis thaliana    1.000   1.000   2.111E-18   0.687   141 37  2   1   138 1   137 9.082E-01   7.532E-01   7.585E-01   7.585E-01   1.315E+01   1.000   17M4I110M3D7M
AF-Q657Z2-F1-model_v4.cif   AF-A0A178VZM6-F1-model_v4   3702    Arabidopsis thaliana    0.783   0.982   4.780E-16   0.731   108 28  1   23  130 2   108 9.057E-01   7.119E-01   8.959E-01   8.959E-01   4.325E+00   1.000   96M1I11M
AF-Q657Z2-F1-model_v4.cif   AF-Q8L8C0-F1-model_v4   3702    Arabidopsis thaliana    0.754   0.945   8.012E-16   0.730   104 27  1   23  126 2   104 9.294E-01   7.164E-01   9.033E-01   9.033E-01   1.776E+00   1.000   96M1I7M
AF-Q657Z2-F1-model_v4.cif   AF-A0A5S9X4E5-F1-model_v4   3702    Arabidopsis thaliana    0.746   0.936   1.037E-15   0.728   103 27  1   23  125 2   103 9.068E-01   7.066E-01   8.892E-01   8.892E-01   1.503E+00   1.000   95M1I7M
AF-Q657Z2-F1-model_v4.cif   AF-A0A7G2EE51-F1-model_v4   3702    Arabidopsis thaliana    0.739   0.455   7.839E-14   0.725   102 27  1   23  124 2   102 8.790E-01   6.788E-01   4.265E-01   4.265E-01   2.473E+00   1.000   97M1I4M
AF-Q657Z2-F1-model_v4.cif   AF-A8MR92-F1-model_v4   3702    Arabidopsis thaliana    0.783   0.945   2.349E-13   0.666   108 32  1   1   108 1   104 7.971E-01   6.133E-01   7.642E-01   7.642E-01   1.016E+01   1.000   16M4I88M
AF-Q657Z2-F1-model_v4.cif   AF-I3ITR1-F1-model_v4   10090   Mus musculus    0.884   0.953   5.799E-13   0.504   123 60  1   4   125 2   124 8.720E-01   7.175E-01   7.661E-01   7.661E-01   1.269E+01   1.000   42M1D80M
AF-Q657Z2-F1-model_v4.cif   AF-Q9BUE6-F1-model_v4   9606    Homo sapiens    0.891   0.961   9.112E-13   0.516   124 59  1   4   126 2   125 8.839E-01   7.278E-01   7.768E-01   7.768E-01   1.157E+01   1.000   42M1D81M
AF-Q657Z2-F1-model_v4.cif   AF-Q9D924-F1-model_v4   10090   Mus musculus    0.884   0.953   2.250E-12   0.512   123 59  1   4   125 2   124 8.754E-01   7.163E-01   7.649E-01   7.649E-01   1.258E+01   1.000   42M1D80M
AF-Q657Z2-F1-model_v4.cif   AF-A0A3A9J936-F1-model_v4   2382222 Roseomonas wenyumeiae   0.833   0.959   1.082E-10   0.401   117 68  2   11  125 1   117 8.118E-01   6.885E-01   7.732E-01   7.732E-01   8.021E+00   1.000   15M1D22M1D78M

 

Steps to Reproduce (for bugs)

As shown below, I performed the reproduction procedure with a newly re-created empty tmp folder. https://gist.github.com/yonezawa-sora/fc5a208da4920d98f36a69e2b65341ea

Foldssek Output (TSV file)

https://gist.github.com/yonezawa-sora/c22aca78712cea675bff04cc817cef0d

Context

My Environment

Sorry for the rudimentary question. Thank you very much.

martin-steinegger commented 3 months ago

@yonezawa-sora how did you solve this issue?

yonesora56 commented 3 months ago

Sorry, I took the liberty of CLOSE. I ran the same command again with the same environment, and this time, there were no hits for different species; only the desired species were retrieved. Therefore, I set the issue status to CLOSE.

However, when I again executed the easy-search command with a different CIF file, hits for different species occurred again.

I have yet to determine under what conditions this happens, and I am currently investigating the results if we change the conditions, such as selecting a single species (e.g., 9606). We will share this information later in the github gist.

I would greatly appreciate any help from you. Thank you very much.

milot-mirdita commented 3 months ago

We had a multi-threading issue that resulted in wrong hits passing or being rejected by the prefilter if a taxonomy expression was given (not only a single taxid). I fixed the issue in both MMseqs2 and Foldseek.

Thanks a lot!

yonesora56 commented 2 months ago

I apologize for the delayed response. Thank you very much for addressing the issue. I'm going to run it and see how it goes!

yonesora56 commented 2 months ago

Only hits for the specified taxonomy can now be retrieved successfully! Thank you very much for your support! We will be closing this issue.