meglecz / mkCOInr

Make a non-redundant, comprehensive COI database from NCBI and BOLD and include customizing options
MIT License
12 stars 2 forks source link

using mkCOInr for custom loci #4

Open James-Kitson opened 1 year ago

James-Kitson commented 1 year ago

Hello Emese,

Firstly thank you for creating this repository, it is a really nice simple-to-use set of functions for COI databases. I'd like to ask your opinion on using it for making databases for other loci. Specifically, I'm trying to make a custom database for 12S to use with the MiFish primers (Miya et al 2015).

At first glance, it looks like I can use NSDPY to get a set of 12S sequences by removing the "-cds" flag but after that, I think I cannot get past format_ncbi.pl in mkCOInr as it is looking for COI sequences specifically. Is this correct and if so is there a way to generally format any set of sequences for the remainder of the tools?

Many thanks,

James

meglecz commented 1 year ago

Hi James,

Thanks for your interest in mkCOInr. I had the intention of adapting mkCOInr to non-coding markers, but I have not yet found the time for it.

Yes, you can use nsdpy without the cds option, but it will have two limitations, that will influence the quality of the final database. First, in the 12S marker can be a part of relatively long sequences containing non-pertinent fragments. It is not a serious handicap, but you will drag unnecessary data forward. The other problem is that sequences are not necessary correctly oriented, that hinders the dereplication step, and makes a problem in some assignment tools.

The format_ncbi.pl was really conceived to analyse CDS fasta files. It can be adapted for gene features format, but it needs some testing and a bit of trial and error for each marker. For example, regular expressions used to recognize the non-standardized gene names should be established carefully. I also noticed that some sequences in NCBI are not annotated correctly, and the gene features format is not always available, or it can contain erroneous information. Unfortunately, I cannot adapt this script to non-coding genes within the next 3 weeks.

I suggest the following workaround:

  1. Use nsdpy with the -i and -t options. This will produce a sequences.tsv file with Name, SeqID, TaxID, Lineage, sequence length and sequence columns. You can easily format this file to sequence tsv with taxIDs
  2. Make a taxonomy file using download_taxonomy.pl
  3. Run dereplicate.pl
  4. You can run select_region.pl to get only the amplicon you are interested in. In fact, this script will correct for both the incorrect sequence orientation and keeps only the relevant parts of the sequences. You might need to test different values for the identity parameter. I benchmarked it for COI (0.7) which is a good compromise between producing false negatives and positives. I am not sure if it is fine for the 12S. It also depends on if you are focusing on a relatively small taxonomic group, or if you want a wide taxonomic coverage.

I hope this helps! I will be happy to help you if you are stuck, and to hear about the outcome of your work.

Emese

James-Kitson commented 1 year ago

Thank you Emese,

I will definitely give this a go later in the week. I'm mainly interesting in FIsh so hopefully the task should be fairly straight forward.

Many Thanks for your help,

James