meglecz / mkCOInr

Make a non-redundant, comprehensive COI database from NCBI and BOLD and include customizing options
MIT License
11 stars 2 forks source link

Sequence name needed in blastn results #6

Open ernestorazuri opened 1 month ago

ernestorazuri commented 1 month ago

Dear Emese, First of all, thank you for creating this tool and the documentation. They are really helpful. I created a custom database to use with the blast+ command line utility. However, the results only show the BOLD (e.g., BOLD_COI-5P_BHMKK024-12) or NCBI (e.g., KX292979_1) accession numbers. Do you know if this is the intended behavior? I'd like to obtain the sequence description or the associated taxonomy in the results instead. I modified the outfmt flag to include taxonomy information, but I only get N/As. Any insight would be greatly appreciated. Thanks again. Best regards, Ernesto

meglecz commented 1 month ago

Hi Ernesto,

If I got it right you have downloaded COInr, and formatted it to make a BLAST database using the format_db.pl script with -outfmt blast.

Then you used the above created BLAST database, to make a BLAST using the -outfmt argument of BLAST to get the description of the sequences and taxonomic information.

The sequence description is not included in COInr, so it is normal, that you cannot get this. You can get, however, the taxIDs of the subject sequences using the staxids in outmft of BLAST (e.g. -outfmt '6 qseqid sseqid pident length qcovhsp staxids evalue'). All positive taxIDs are valid NCBI taxIDs. The negative values are arbitrary, they refer to taxa not present in NCBI taxonomy.

If you want the lineages of each taxID or subject, there are different solutions.

  1. You can use format_db.pl -outfmt vtam This will create a BLAST database and a taxonomy file with the following columns • tax_id • parent_tax_id • rank • name_txt • old_tax_id (old_tax_id merged to tax_id) • taxlevel (see https://mkcoinr.readthedocs.io/en/latest/content/io.html#vtam-database-files ). You need a bit of programming, to get the lineages from taxID to parent_tax_id iteratively.

  2. You can use format_db.pl -outfmt full (https://mkcoinr.readthedocs.io/en/latest/content/io.html#full-tsv ). There is a lot of redundancy in this file, but it is easy to get the lineages.

  3. If you want to make BLAST for taxonomic assignment, you can try mkLTG (https://github.com/meglecz/mkLTG ) using a vtam formated COInr database. mkLTG is a BLAST based LCA method using different identity thresholds iteratively.

I hope this helps, Emese

ernestorazuri commented 1 month ago

Thanks a lot for such a thorough response. I'll give it a go!