gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
21 stars 8 forks source link

Extract amplicon sequences from downloaded reference database #20

Open FabioGAmaral opened 1 year ago

FabioGAmaral commented 1 year ago

Hi, I am testing crabs to get a reference library of 12S. I noticed that after step 4 I got some small size sequences (for example 20 bps) and in other cases some sequences that bind only by one of the primers (sometimes fwd and sometimes rev) which should not happen when I use the following specifications "--percid 0.95 --coverage 0.95 --filter_method strict". What could this problem be due to? Is there any way to retain only the regions between the 2 primers?

gjeunen commented 1 year ago

Hello @FabioGAmaral,

Thank you for your query.

It is a bit difficult to provide an answer based on the limited provided information. I'm guessing you are trying to run the pairwise global alignment command (function: pga)? Based on the parameters you mention, you should only receive sequences that are at least 95% similar and cover at least 95% of any of the database sequences. These database sequences should be retrieved through the insilico_pcr function and, therefore, should contain the full amplicon sequence without primer-binding regions. Can you verify that the output of the insilico_pcr function contains only full-length amplicons? Or does this file contain sequences of length 20 bps?

I'm not sure what you mean by "bind only by one of the primers". My apologies, but could you please expand? Could you also please provide the code you used and some examples of the sequences you are referring to?

Best regards, Gert-Jan

FabioGAmaral commented 1 year ago

Hello, sorry for not explain it well. The output of the insilico_pcr contains only full-length amplicons (at least all of them have the , however during the pga some amplicons have only 20 bps. The command I'm using is: crabs pga --input "output of crabs db_download" --output Teleostei_MiFishU_teste.fasta --database "insilico_PCR output" --fwd "primer" --rev "primer" --speed medium --percid 0.95 --coverage 0.95 --filter_method strict.

The size of the in-silico PCR is as expected (~170 bps), but the pga output shows very small fragments even with coverage 0.95 and percid also 0.95

gjeunen commented 1 year ago

Hello @FabioGAmaral,

I have not come across this issue before and will investigate. For now, you could filter the short reads using the seq_cleanup function.

Best regards, Gert-Jan

FabioGAmaral commented 1 year ago

Hello, I have another issue. When I'm doing crabs visualization --method primer_efficiency. I find this error written 89916 sequences from Vert_Ib.fasta into a dictionary Traceback (most recent call last): File "/home/whoami/reference_database_creator/crabs", line 1428, in main() File "/home/whoami/reference_database_creator/crabs", line 1425, in main args.func(args) File "/home/whoami/reference_database_creator/crabs", line 981, in visualization orig = str(raw_dict[id]) KeyError: 'AF402623'

What could it be?

gjeunen commented 1 year ago

Hello @FabioGAmaral,

Based on the error it seems that the seqID AF402623 is in the input file, but cannot be found in the original raw sequencing file. Is this correct?

Best regards, Gert-Jan

FabioGAmaral commented 1 year ago

Based on the error it seems that the seqID AF402623 is in the input file, but cannot be found in the original raw sequencing file. Is this correct? Exactly.

I do have another doubt/question. During db_download using BOLD I noticed that the sequences that have gaps are eliminated. is there a possibility to remove the gaps from the seqs instead of remove the seqs?

Best regards, Fábio

gjeunen commented 1 year ago

Hello @FabioGAmaral,

This would require me to change a bit of the code base and add this functionality in. I'll try to get this incorporated by the end of next week.

Best, Gert-Jan

FabioGAmaral commented 1 year ago

Hello @FabioGAmaral,

This would require me to change a bit of the code base and add this functionality in. I'll try to get this incorporated by the end of next week.

Best, Gert-Jan

Hello, any news about it?

gjeunen commented 1 year ago

Hello @FabioGAmaral,

Apologies for the delay, it has been a hectic time!

This functionality has now been made available in version 0.1.5. As of now, you can only download this version through GitHub. We will try to update the Docker and conda versions as soon as possible, but there might be some delay on this. To check your version of CRABS, please use the following code: crabs --version.

To incorporate BOLD sequences with gaps into the output file, you can set the -g, --boldgap parameter to INCORPORATE. This will remove the gaps of the BOLD sequence, i.e., remove the '-' symbols, and add the sequence to the output file. By default this parameter is set to DISCARD.

I tested the new parameter using the following 2 lines of code:

crabs db_download -s bold -db Elasmobranchii -o bold_elasmobranchii_20230607.fasta -d bold_elasmobranchii_20230607_discard.fasta 
db_download -s bold -db Elasmobranchii -o bold_elasmobranchii_20230607.fasta -d bold_elasmobranchii_20230607_discard.fasta -g INCORPORATE

By setting the -g, --boldgap parameter to INCORPORATE, CRABS managed to add all 23,964 initially downloaded Elasmobranchii sequences to the output file for downstream analysis.

PLEASE NOTE: I have not looked further into how incorporating these sequences in the CRABS database might impact/influence downstream analyses and results.

If you encounter any issues or bugs when using the new parameter, please feel free to comment below.

Best, Gert-Jan

FabioGAmaral commented 1 year ago

Good afternoon, It worked very well thanks.

Sorry for the inconvenience, but is there a possibility to have a mapping file with the information between the two files that the program creates in the initial step (download) because one of the files does not indicate the gene and for bold the CRABS program does not allow to filter by the gene.

Best, Fábio Amaral

gjeunen commented 1 year ago

Hello @FabioGAmaral,

I'm not sure I understand what you're referring to. Could you please clarify, maybe with an example?

Thanks, Gert-Jan

FabioGAmaral commented 1 year ago

When i download sequences through BOLD, crabs creates 2 files with different headers. In one of them, more complete, it says which gene that sequence refers to, and in the other it doesn't. Is it possible to keep the header in both sequences and the during the following points.

gjeunen commented 1 year ago

Hello @FabioGAmaral,

I'll have to look into the formatting and gene option available to design a parameter for selecting which genes to keep. Please give me a couple days to investigate.

Thanks, Gert-Jan

FabioGAmaral commented 1 year ago

Got any news?

gjeunen commented 10 months ago

Hello @FabioGAmaral,

My sincere apologies for the long pause in the communication!

I have now added in the option of genetic marker selection for downloading sequences from BOLD via the -m, --marker selection. This will allow you to specify which barcoding gene to download from the BOLD website. Please find more information in the updated README document. Please install CRABS --version 0.1.7 via GitHub for this functionality. We'll try to update the Docker and conda versions in the near future, but this could take a while. I hope this is what you were after?

Unfortunately, changing the sequence headers to display this information will be difficult to implement, as the header structure is used downstream for taxonomy assignment. I'm currently working on a full overhaul of the code with additional functionality to CRABS and will keep header structure in mind for the next big version update.

Best regards, Gert-Jan