gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
28 stars 8 forks source link

What delimiter to use with db import for standard NCBI and EMBL databases #31

Closed slambrechts closed 1 year ago

slambrechts commented 1 year ago

Hi,

I read:

CRABS will automatically format the downloaded sequences to a simple two-line fasta format with NCBI accession numbers as header information and delete the original fasta file. When accession numbers are unavailable, CRABS will generate unique sequence IDs using the following format: 'CRABS_[num]:species_name'

I assume --delim '_' is not ideal then, since when accession numbers are unavailable, CRABS uses unique sequence IDs using 'CRABS_[num]:species_name', and thus an underscore?

In the first scenario I assume there is no problem, since there are only NCBI accession numbers as headers?

Kind regards, Sam

gjeunen commented 1 year ago

Hello @slambrechts,

Thank you for your query.

If you are using the function db_download to download data from BOLD and NCBI, there is no need to use db_import, as CRABS does this automatically. If, on the other hand, you want to import your own barcodes, you can tell CRABS where to find the accession number or species info in your sequence headers. For example, let's say that your fasta file is structured as below:

>Homo sapiens; A BUNCH OF METADATA
ACGT

You can tell CRABS that the species info -s species is found by using the delimiter -d ;. The species or accession info needs to be placed before the delimiter. Metadata needs to be removed for CRABS to work, as CRABS uses the full header to determine taxonomy at a later stage. If your data is structured in a different way, where species or accession info is not placed before the delimiter, please let me know and I'll add in a functionality where you can specify where CRABS can find this info in your header. Unfortunately, this is currently not feasible in CRABS.

I hope this answers your question, but please let me know if something is not clear or if I have misinterpreted your query.

Best regards, Gert-Jan

slambrechts commented 1 year ago

Dear Gert-Jan,

Ok, thank you for the info. For now we want to create a reference database using the standard NCBI and/or EMBL databases only. So if I understand correctly, we can go directly from db_downloadto db_merge, to merge the downloaded NCBI and EMBL databases? Fyi we are using primers that target mitochondrial 16S genes, one set targetting Collembola, the other targetting Oligochaeta.

Kind regards, Sam

gjeunen commented 1 year ago

Dear @slambrechts,

Yes, please move straight from db_download to db_merge.

Thanks, Gert-Jan