gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
28 stars 8 forks source link

tax_format: loose species names for dad format #65

Closed nroux66 closed 2 months ago

nroux66 commented 2 months ago

First thank you very much for all the detailed explanations provided to run your programm. It's really great, especially for people who are just beginning programming. I have successfully retrieved marine mammals sequences for a 16S marker, however I noticed that when running the tax_format function to save my database to be used in dada2 (using the format dad) I am loosing the species name in the process, keeping only the genus. Is it normal ? if not how could I solve this ? I double checked that all the taxonomic informations, included species name are in the input file used to run the command tax_format. Here is an example of the given file: OP205218 302098 Eukaryota Chordata Mammalia Artiodactyla Balaenidae Eubalaena Eubalaena_japonica DNA SEQUENCE

and here is what I get after running the following command on WSL:

Eukaryota;Chordata;Mammalia;Artiodactyla;Balaenidae;Eubalaena

Command used: docker run --rm -it \ -v $(pwd):/data \ --workdir="/data" \ quay.io/swordfish/crabs:0.1.7 \ crabs tax_format \ --input cetacea_16SMarver3_singlespFilt.tsv \ --output cetacea_16SMarver3_singlespFiltDADA2.fasta \ --format dad

Thank you very much for your help.

gjeunen commented 2 months ago

Hello @nroux66,

Thank you for using CRABS.

When we developed CRABS, the reference database for dada2 was split into two files, which are the format options of --format dad and --format dads. The --format dad option is the one you have right now and does not contain the species information. The --format dads option is a second database that contains the species information for each reference sequence. I believe you need both for dada2.

Please let me know if the formatting requirements for dada2 changed in newer versions of dada2, and we can alter the formatting of the CRABS output for dada2 so that it fits the formatting requirements of dada2.

I hope this helps.

Best wishes, Gert-Jan

nroux66 commented 2 months ago

Dear Gert-Jan,

thank you very much for your quick reply. I have double checked the format for the reference database in Dada2. To match the assispecies function of dada2 the --format dads with ID order species matches Dada2 requirements.

However, for the assigntaxonomy function of dada2 it seems that the default format is kingdom;phylum;class;order;family;genus;species whereas your --format dad is domain;phylum;class;order;family;genus

So I assume it has changed in between the time you have made the formating requirements for your program and what dada2 has implemented.

Thank you again ! Best Natacha

gjeunen commented 2 months ago

Hello Natacha,

Thank you very much for this information. I will update the CRABS code to accommodate the new formatting. I'll keep you posted on when I have a working version. It should be by the end of this week.

Best wishes, Gert-Jan

gjeunen commented 2 months ago

Hello Natacha,

In the latest update, you can choose the following option crabs tax_format --format 'dada2'. This should provide you with the correct formatting to import into DADA2. This option is available in crabs --version 0.1.9. Please make sure to clone the CRABS code from GitHub, as this is currently the only place where this option is available. We will make sure that we will update the other environments, such as Docker, shortly.

I will close this issue now, but feel free to reopen if something with this new option is not correct.

Best wishes, Gert-Jan

nroux66 commented 2 months ago

Hello Gert-Jan,

thank you very much for your quick replies and for having made that change.

Best

Natacha

Le 30/08/2024 à 23:25, Gert-Jan Jeunen a écrit :

Hello Natacha,

In the latest update, you can choose the following option |crabs tax_format --format 'dada2'|. This should provide you with the correct formatting to import into DADA2. This option is available in |crabs --version| 0.1.9. Please make sure to clone the CRABS code from GitHub, as this is currently the only place where this option is available. We will make sure that we will update the other environments, such as Docker, shortly.

I will close this issue now, but feel free to reopen if something with this new option is not correct.

Best wishes, Gert-Jan

— Reply to this email directly, view it on GitHub https://github.com/gjeunen/reference_database_creator/issues/65#issuecomment-2322358899, or unsubscribe https://github.com/notifications/unsubscribe-auth/BK3C4GCVSNJ2U4B5PNVQCV3ZUDPM5AVCNFSM6AAAAABNIG75UKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGM2TQOBZHE. You are receiving this because you were mentioned.Message ID: @.***>

--

Dr. Natacha ROUX Post-Doctorant CRIOBE UAR3278 58 avenue Paul Alduy 66000 Perpignan

-- Cet e-mail a été vérifié par le logiciel antivirus d'Avast. www.avast.com