gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
24 stars 8 forks source link

In silico PCR (Fatal error) #56

Open simiaoyan opened 6 months ago

simiaoyan commented 6 months ago

Hello Gert-Jan,

I have the same problem as #12. I've used the latest version 0.2.0 of CRABS, but I still get the error. Here are my code and results.

crabs insilico_pcr --input BOLD.fasta --output BOLD_020.fasta --fwd GGDACWGGWTGAACWGTWTAYCCHCC --rev CAAACAAATARDGGTATTCGDTY --error 3

reading BOLD.fasta into memory 100%|████████████████████████████████████████████████████████████████████████████████████████████▋| 4620813830/4634397346 [00:08<00:00, 530064189.39it/s] found 6791758 sequences in BOLD.fasta running in silico PCR on fasta file containing 6791758 sequences counting the number of sequences found by in silico PCR 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 601482108/601482108 [00:02<00:00, 213284734.54it/s] found primers in 3844159 sequences reading sequences without primer-binding regions into memory 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 2018376375/2018376375 [00:02<00:00, 719837247.72it/s] reverse complementing 2947599 untrimmed sequences

Fatal error: Illegal character '-' in sequence on line 4898 of FASTA file running in silico PCR on 2947599 reverse complemented untrimmed sequences counting the number of sequences found by in silico PCR 0it [00:00, ?it/s] found primers in 0 sequences**

I have tried to replace the '-' in the input fasta file with a space, that is ' ', which shows that the operation is successful. Is this practice OK? If not, how can I solve the problem? Thank you very much!

reading BOLD_space.fasta into memory 100%|████████████████████████████████████████████████████████████████████████████████████████████▋| 4425833002/4439416518 [00:19<00:00, 225796614.77it/s] found 6791758 sequences in BOLD_space.fasta running in silico PCR on fasta file containing 6791758 sequences counting the number of sequences found by in silico PCR 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 601781229/601781229 [00:02<00:00, 209925508.46it/s] found primers in 3848367 sequences reading sequences without primer-binding regions into memory 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1911791503/1911791503 [00:02<00:00, 672880211.06it/s] reverse complementing 2943391 untrimmed sequences WARNING: 3 invalid characters stripped from FASTA file: I(3) REMINDER: vsearch does not support amino acid sequences running in silico PCR on 2943391 reverse complemented untrimmed sequences counting the number of sequences found by in silico PCR 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 72190/72190 [00:00<00:00, 133917207.32it/s] found primers in 460 sequences

Best, simiaoyan

gjeunen commented 6 months ago

Hello @simiaoyan,

Can you please provide an example of the sequence that is causing the issue?

Thanks, Gert-Jan

simiaoyan commented 6 months ago

Hello Gert-Jan,

Here are two sequences in the BOLD database that I used for in silico PCR. An error about "Fatal error: Illegal character '-' in sequence on line 4898 of FASTA file" occurs because there are some '-' in the sequences.

ACUFI1819-15 -----------------------------------------------------------TAATTCGAATAGAATTAAGTCATCCTGGAATATGAATTAATAATGATCAAATTTATAATTCTTTAGTAACTAGTCATGCATTTTTAATAATTTTTTTTATAGTAATACCATTTATAATCGGAGGATTTGGAAATTATTTAATTCCATTAATACTAGGATCACCAGATATAGCTTTCCCTCGAATAAATAATATTAGATTTTGACTTTTACCTCCATCATTATTTATATTATTATTAAGAAATTTATTTACACCTAATGTAGGTACAGGATGAACTGTTTACCCTCCTTTATCTTCATATTTATTTCATTCATCCCCATCAATTGATATTGCAATTTTTTCCTTACATATATCAGGAATTTCTTCTATTATTGGATCATTAAATTTTATTGTTACTATTTTAATAATAAAAAATTTTTCATTAAATTATGATCAAATTAACTTATTTTCATGATCAGTATGTATTACTGTAATTTTATTAATTTTATCTTTACCAGTATTAGCTGGAGCAATTACTATATTACTTTTTGATCGAAATTTTAATACCTCATTT------------------------------------------------ ACUFI1979-16 -------------------------------GGAATAATCGGGACTGCTTTAAGTATATTAATTCGACTAGAATTAAGATCCCCTAGATCTATATTAGGAAATGATCAAGTTTATAATGCTATAGTTACTAACCACGCTTTTATTATAATTTTTTTTATAGTAATACCATTTATAATTGGGGGGTTTGGAAATTGACTAATCCCTATTATATTGGGGGCCCCTGATATAGCTTTCCCCCGTATAAATAATATAAGTTTTTGGTTGCTGCCCCCCTCAATTTTTTTACTTTTAAGAGGAAGAATAATCGCTGCGGGCCCTGGCACTGGATGAACAATATATCCTCCTCTTTCTCATAATATTTCTCACAGAAGAATATCAGTTGATTTAACTATTTTATCTCTTCATTTAGCAGGAGCTTCATCTATTATAGGGGCAGTTAATTTTATTGCAACAATTTTAAATATAATTCATAATAATATAAAA------------------------ATAACTCAAATAAACTTATTTATTTGATCAATTTTTATTACAGTAATTTTACTATTACTATCCTTACCAGTTTTAGCAGGAGCGATTACTATACTACTAGCTGATCGGAATTTAAATACATCTTTTTTTGACCCTGCTGGGGGGGGTGATCCAATTTTATTCCAACATTTATTT

I try to replace the '-' in the input fasta file with a space, that is ' ', which shows that the operation is successful. But there is a warning about "WARNING: 5492411 invalid characters stripped from FASTA file: (5492408) I(3)". I don't know if this operation will affect the results of in silico PCR. Thank you very much!

Best, simiaoyan

gjeunen commented 6 months ago

Dear Yan Ziling,

If you're using the latest version of CRABS, which is 0.1.8, the - in BOLD sequences will be processed according to the --boldgap parameter. The default setting --boldgap DISCARD will discard these sequences, while the --boldgap INCORPORATE will change the - to `. Upon initial testing, this seems to be working fine. Though further testing would be ideal, including seeing the impact to change-toN`.

Can you please check the version you're using with the following code: crabs --version? You mention version 0.2.0 in your first post, but this version does not exist yet, I believe. When using version 0.1.8, you can specify how the sequences are being treated when downloading the sequences from BOLD as explained above.

Since you already downloaded the sequences, I suggest you either switch the - to ` (not a space, change to nothing), to follow the same implementation in CRABS or change the-toN`, which should work as well.

I hope this helps. Please let me know if you encounter any other errors.

Best, Gert-Jan

simiaoyan commented 6 months ago

Hello Gert-Jan,

The version I currently use is 0.1.4. I have tried to switch the - to `` in the input fasta file, which shows that the operation is successful. But there is a warning about "WARNING: 3 invalid characters stripped from FASTA file: I(3)". Does this warning affect the results of in silico PCR? Thank you!

reading BOLD_Pub_COI_2403.fasta into memory 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌| 4425833002/4439416518 [00:09<00:00, 469831020.23it/s] found 6791758 sequences in BOLD_Pub_COI_2403.fasta running in silico PCR on fasta file containing 6791758 sequences counting the number of sequences found by in silico PCR 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 601781476/601781476 [00:03<00:00, 176826445.15it/s] found primers in 3848367 sequences reading sequences without primer-binding regions into memory 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1911791503/1911791503 [00:03<00:00, 606833407.39it/s] reverse complementing 2943391 untrimmed sequences WARNING: 3 invalid characters stripped from FASTA file: I(3) REMINDER: vsearch does not support amino acid sequences running in silico PCR on 2943391 reverse complemented untrimmed sequences counting the number of sequences found by in silico PCR 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 72190/72190 [00:00<00:00, 113360840.79it/s] found primers in 460 sequences

Best, simiaoyan

gjeunen commented 6 months ago

Dear Yan Ziling,

The issue is originating from VSEARCH, which is used by CRABS during reverse complementing the sequences. It seems some of your sequences contain a I base, which is what the error is reporting. Based on the report, those bases are removed during reverse complementation. Though, you might want to check the source code of VSEARCH to determine what is happening exactly.

However, since it is only 3 residues that are being removed, I doubt there will be an impact on the in silico analysis.

Best regards, Gert-Jan