soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 195 forks source link

mmseq2 with ncbi nr data base taxonamy issue #856

Open feixiang1209 opened 4 months ago

feixiang1209 commented 4 months ago

Dear mmseq2 team

I am trying to create taxonomy database for ncbi nr data base. First I downloaded nr.fa, taxdump and prot.accession2taxid. Then I ran the below commands

mmseqs createdb nr.fa nrDB

mmseqs createtaxdb nrDB tmp --ncbi-tax-dump ./taxdump/ --tax-mapping-file ./prot.accession2taxid

After a few hours, the run completed without error. However, file nrDB_mapping is empty. Could you please advise where I did wrongly?

Thanks a lot

AndrazMarinc commented 4 months ago

Probably you've solved it by now, but still. Have you tried using mmseqs nrtotaxmapping after the createtaxdb? I saw your comment in the other issue and I think nrtotaxmapping is the solution. Not 100% sure though.

feixiang1209 commented 4 months ago

Thanks for your reply, I tried to run mmseqs nrtotaxmapping after “mmseqs createtaxdb nrDB tmp --ncbi-tax-dump ./taxdump/ --tax-mapping-file ./prot.accession2taxid” using command "mmseqs nrtotaxmapping accession2taxid/prot.accession2taxid nrDB output.tsv". The output.tsv is as below, should I replace nrDB_mapping with this file?

0 1047168 1 185202 2 412384 3 3072323 4 150340 5 1573704 6 2517205 7 286 8 1307 9 2635419 10 34041 11 2212474 12 1487711 13 1871050

AndrazMarinc commented 4 months ago

I think so. I'm looking at the code the mmseqs devs linked in the previous issue and it seems that's what their script does. You'll basically create the ${OUTDB}_mapping file by renaming your tsv.

${MMSEQS}" nrtotaxmapping "${TMP_PATH}/pdb.accession2taxid" "${TMP_PATH}/prot.accession2taxid" "${OUTDB}" "${OUTDB}_mapping" ${THREADS_PAR}

milot-mirdita commented 4 months ago

Thanks a lot @AndrazMarinc!

That looks correct! You still have to call the createtaxdb after you replace the _mapping file to create the _taxonomy file that contains all the taxdump information.

feixiang1209 commented 4 months ago

Thanks a lot @AndrazMarinc and @milot-mirdita . It worked. Also I found another solution, the file "prot.accession2taxid" download from NCBI needs modification. Only two columns (accession.version and taxid) are needed to run createtaxdb.