meglecz / mkCOInr

Make a non-redundant, comprehensive COI database from NCBI and BOLD and include customizing options
MIT License
13 stars 4 forks source link

Create format_VSEARCH.py #7

Closed chiras closed 1 week ago

chiras commented 1 week ago

Thank you for this tool!

I needed to create a VSEARCH compatible output format for global alignments and SINTAX, wherefore I created a python script. If you feel it might be of use for others as well, feel free to include into your scripts.

I have included a requirement that only such records with at least family level are included. Otherwise it should be equivalent to the original database.

All the best, Alex

meglecz commented 1 week ago

Hi Alex,

Thank you for using COInr and suggesting a script to format the database for SINTAX. It is a very good initiative. However, I prefer to add a SINTAX option to the existing format_db.pl script.

There are a few reasons for this. COInr contains plants, fungi and animal taxa, and there is a lot of homonymy between these groups. This can make bugs for some of the taxonomic assignment programs. I have not tested SINTAX, but I know that it does make a problem for RDP classifier. I am using a workaround by attaching taxIDs to the taxon names. The format is a bit awkward, but at least there is no ambiguity, so I would like to stick to this solution.

The NCBI lineages do not always have all major taxonomic levels. For example, in the following lineage, the order is missing : ‘Metazoa; Eumetazoa; Bilateria; Protostomia; Spiralia; Lophotrochozoa; Mollusca; Gastropoda; Patellogastropoda; Lottioidea; Lottiidae’. Again to avoid bugs of taxonomic assignments that use fixed taxonomic levels, I prefer adding a name like Gastropoda_order as a partially meaningful placeholder.

At last, I prefer keeping the modularity of mkCOInr. There are different scripts that allow users to select taxa, or sequences with a minimum level of resolution, select a particular region, add custom sequences, dereplicate sequences etc. These are independent of the script that formats the output (format_db.pl). This modularity requires the execution of more commands, but it assures the flexibility, to cover many different needs.

Thanks again for pointing out that formatting the database to SINTAX is useful. mkCOInr is now updated and can do that.

Regards, Emese

chiras commented 1 week ago

Fine by me! All the best. I subset later for only animals and mostly use it for global alignments with VSEARCH, but it is anyway better integrated directly!

Cheers