seqan / slimm

Species Level Identification of Microbes from Metagenomes
Other
27 stars 3 forks source link

Accessions not mapped to taxaid #34

Closed your-highness closed 5 years ago

your-highness commented 5 years ago

Dear @temehi

$ slimm --version
slimm version: 0.3.4
SeqAn version: 2.4.0

Following the procedure outlined in https://github.com/seqan/slimm/wiki/Preparing-a-custom-database I would like to build a slimmdb for the following FASTA file:

>NC_024015.1
AGAATTTGCCC
>NC_001798.2
AGTCCCCGTCT
>NC_030692.1
TGTTGCGTTAA
>NC_027892.1
CAGCTCTCGCA
>NC_029642.1
TGTTGCGTTAA

I downloaded the taxdump.tar.gz and *.accession2taxid.gz as outlined.

The building of the db fails because all five accessions can not be mapped:

$ slimm_build -v -nm names.dmp -nd nodes.dmp -o test.sldb test.fasta *.accession2taxid.gz
[MSG] getting accessions numbers from fasta file ...
[MSG] mapping accessions to taxaid ...
[VERBOSE MSG] mapping file: [1/3]       iter: [1]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [2/3]       iter: [1]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [2/3]       iter: [2]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [2/3]       iter: [3]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [2/3]       iter: [4]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [1]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [2]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [3]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [4]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [5]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [6]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [7]       accessions left: [5/5]
[VERBOSE MSG] mapping file: [3/3]       iter: [8]       accessions left: [5/5]
[WARNING!] 5 accessions (NC_001798, NC_024015, NC_027892, ...) were not mapped to taxaid.
[WARNING!] Take a look at test.missed file for a complete list.
[WARNING!] Try including the more ACCESSION2TAXAID MAP FILE (e.g. dead_nucl.accession2taxid)
[MSG] loading nodes and names mappings from files ...
[MSG] getting taxonomic linages and resolving names ...

However the accessions exist in the *.accession2taxid.gz files:

$ zcat *.accession2taxid.gz | grep -e "NC_024015.1" -e "NC_001798.2" -e "NC_030692.1" -e "NC_027892.1" -e "NC_029642.1"
NC_024015       NC_024015.1        1587534 612184456
NC_001798       NC_001798.2        10310   820945149
NC_027892       NC_027892.1        1715290 931317065
NC_029642       NC_029642.1        1715293 1004345262
NC_030692       NC_030692.1        1714622 1049010306

I also tried with unzipping the accession2taxid.gz files to no avail.

Can you provide any help?

Best,

your-highness commented 5 years ago

Dear @temehi

I don't know what went wrong yesterday but the command slimm_build -v -nm names.dmp -nd nodes.dmp -o test.sldb test.fasta *.accession2taxid build the database successfully. Yesterday I used the gzipped version of the accession2taxid database and tried also the unzipped version. I think it's a mistake on my side.

I am really sorry for any inconvenience.

Best