Open ernestorazuri opened 1 month ago
Hi Ernesto,
If I got it right you have downloaded COInr, and formatted it to make a BLAST database using the format_db.pl
script with -outfmt blast
.
Then you used the above created BLAST database, to make a BLAST using the -outfmt
argument of BLAST to get the description of the sequences and taxonomic information.
The sequence description is not included in COInr, so it is normal, that you cannot get this.
You can get, however, the taxIDs of the subject sequences using the staxids
in outmft
of BLAST (e.g. -outfmt '6 qseqid sseqid pident length qcovhsp staxids evalue'
). All positive taxIDs are valid NCBI taxIDs. The negative values are arbitrary, they refer to taxa not present in NCBI taxonomy.
If you want the lineages of each taxID or subject, there are different solutions.
You can use format_db.pl -outfmt vtam
This will create a BLAST database and a taxonomy file with the following columns
• tax_id
• parent_tax_id
• rank
• name_txt
• old_tax_id (old_tax_id merged to tax_id)
• taxlevel
(see https://mkcoinr.readthedocs.io/en/latest/content/io.html#vtam-database-files ). You need a bit of programming, to get the lineages from taxID to parent_tax_id iteratively.
You can use format_db.pl -outfmt full
(https://mkcoinr.readthedocs.io/en/latest/content/io.html#full-tsv ). There is a lot of redundancy in this file, but it is easy to get the lineages.
If you want to make BLAST for taxonomic assignment, you can try mkLTG (https://github.com/meglecz/mkLTG ) using a vtam formated COInr database. mkLTG is a BLAST based LCA method using different identity thresholds iteratively.
I hope this helps, Emese
Thanks a lot for such a thorough response. I'll give it a go!
Dear Emese, First of all, thank you for creating this tool and the documentation. They are really helpful. I created a custom database to use with the blast+ command line utility. However, the results only show the BOLD (e.g., BOLD_COI-5P_BHMKK024-12) or NCBI (e.g., KX292979_1) accession numbers. Do you know if this is the intended behavior? I'd like to obtain the sequence description or the associated taxonomy in the results instead. I modified the outfmt flag to include taxonomy information, but I only get N/As. Any insight would be greatly appreciated. Thanks again. Best regards, Ernesto