meglecz / mkCOInr

Make a non-redundant, comprehensive COI database from NCBI and BOLD and include customizing options
MIT License
12 stars 2 forks source link

Can't locate mkdb module #1

Closed louis94270 closed 2 years ago

louis94270 commented 2 years ago

Hi Emese,

I'm trying to use the COIrn database to analyze river eDNA samples for macro-invertebrate diversity using the DADA2 pipeline. I would like to sub-sample the database for only target taxa. I have tried to follow your tutorial but when I enter this command:

perl select_taxa.pl -taxon_list ../custom/macro_taxa_list.txt -tsv ../custom/COInr_custom.tsv \ -taxonomy ../custom/taxonomy_updated.tsv -min_taxlevel species -outdir ../custom/selected/ \ -out COInr_custom_selected.tsv

I'm not sure my argument setting are right yet but my main problem at the moment is that I get this error:

Can't locate mkdb.pm in @INC (you may need to install the mkdb module)

I'm not familiar with perl so I don't really know what to make of this error but it seems like I'm missing a module. After some research online, I couldn't find a way to install this module. Is this a custom module ? I'm wondering how to solve this issue. Hope you can help.

Thank you very much for creating the database, I have been trying unsuccessfully to create a database for COI from both NCBI and BOLD for a while now. I'm really hopeful that I can use your database and format it for the DADA2 assignTaxonomy.

Thanks again, Louis

meglecz commented 2 years ago

Hi Louis,

You have to run the scripts from the mkCOInr/scripts directory or if you want to run it from elsewhere, you can make a symbolic link of the mkdb.pm file to somewhere easier to find:

ln -s mkCOInr/scripts/mkdb.pm /etc/perl/mkdb.pm

If I get it right, your first step is to select sequences for a list of taxa (macro_taxa_list.txt). In that case your inputs will be the COInr.tsv, taxonomy.tsv as you downloaded it from zenodo.

Let’s say you start from a file structure like this:

mkCOInr
├── COInr
│   ├── COInr.tsv
│   └── taxonomy.tsv
├── macro_taxa
│   └── macro_taxa_list.tsv
└── scripts
    ├── add_taxids.pl
    ├── dereplicate.pl
...

The command will look like something like this:

cd mkCOInr/scripts
perl select_taxa.pl -taxon_list ../macro_taxa/macro_taxa_list.txt -tsv ../COInr/COInr.tsv -taxonomy ../COInr/taxonomy.tsv  -min_taxlevel species  -outdir ../macro_taxa -out COInr_macro_taxa.tsv

Then you can run the format_db.pl on the output files which are in the macro_taxa directory. I guess rdp option will be the best for DADA2 assignTaxonomy, but I am not absolutely sure. Keep me updated.

I hope this helps, Emese

louis94270 commented 2 years ago

Hi Emese,

Thank you for the quick response ! I was running from the scripts directory but it was weirdly not running. I used the symbolic link and it is running now. Thank you very much ! If anyone has the same problem, just be careful of using an absolute path for SOURCE in:

ln -s SOURCE TARGET.

I will try the rdp option and let you know how it does.

Thanks again, Louis

louis94270 commented 2 years ago

Hi Emese,

Just letting you know that rdp option seems to work perfectly ! I have some memory issue to work out before fully testing it but using a small subset of the data base in DADA2 looked good.

Is there a simple way to modify the format_bd.pl to not include taxID in the taxanomy ? Can always do it down the line but I was just wondering.

Thanks again for the great tool !

Louis

meglecz commented 2 years ago

Hi Louis,

Thanks for the feedback ! Yes, training a classifier needs a lot of memory, but that is the price to pay to have your private database with only the reference sequences you want 😉

I know that the taxon names with the taxIDs are ugly, but this is a way to avoid issues with homonymy. So, I would not eliminate them, because you might run into some problems when training your database.

On the other hand, you can eliminate them in the output of taxonomic assignments. The format is taxon_nametaxid, where taxid is a positive or negative integer. So you just have to delete everything starting from the last underscore. In perl, the regular expression that corresponds to this is something like this [-0-9]+$, and this syntax can be used in a series of text editors.

Cheers, Emese

louis94270 commented 2 years ago

Great ! I'll take the taxIDs off the output of taxonomic assignments then.

Thanks for the clarifications !

Louis