gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
25 stars 8 forks source link

In silico PCR #12

Closed sroblet closed 1 year ago

sroblet commented 1 year ago

Hi,

I just tried the function in silico PCR on my merged database. At the begining, it was working well with the first in silico PCR and the primers were found in many sequences but then i had some issues with the second in silico PCR :

crabs insilico_pcr --input merged_table.fasta --output insilico_MiFish.fasta --fwd GTCGGTAAAACTCGTGCCAGC --rev CATAGTGGGGTATCTAATCCCAGTTTG --error 4.5

reading merged_table.fasta into memory 100%|████████████████████████████████████████████| 3128366143/3128366143 [00:01<00:00, 1942824889.56it/s] found 1788382 sequences in merged_table.fasta running in silico PCR on fasta file containing 1788382 sequences counting the number of sequences found by in silico PCR 100%|█████████████████████████████████████████████████| 21454511/21454511 [00:00<00:00, 287835990.25it/s] found primers in 117681 sequences reading sequences without primer-binding regions into memory 100%|████████████████████████████████████████████| 1516920127/1516920127 [00:01<00:00, 1237075523.73it/s] reverse complementing 1670701 untrimmed sequences

Fatal error: Illegal character '-' in sequence on line 16 of FASTA file running in silico PCR on 1670701 reverse complemented untrimmed sequences counting the number of sequences found by in silico PCR 0it [00:00, ?it/s] found primers in 0 sequences

I'm not sure that having error in the second in silico PCR is a big deal, but i prefer to ask.

Thanks again !

sroblet commented 1 year ago

Hi, I think i've found where is the problem. On the line 16 of my fasta file, there are many "-" for some reason, but i will remove then manually. Anyway, If i have 0 hit for the second in silico PCR, i guess that it means that no sequence is in the opposite direction in my database but that is not an error.

gjeunen commented 1 year ago

Hello @sroblet,

Yes, finding 0 sequences when reverse complementing the database is not an issue. It means that no sequences were deposited in the opposite direction (or at least inferred from searching for the forward and reverse sequences. From experience, there is a small percentage of reads that can be recovered this way in most instances.

The error message is pointing towards the reverse complementing step. One of the sequences contains the "-" character, which CRABS cannot reverse complement. It is at this point that the process is terminated. If you mention that this is occurring at line 16, it means only 7 sequences (2-line fasta format) are reverse complemented and searched for in the second in silico PCR step.

Can you please post the sequences below that contain a "-" in the sequence line (if there are multiple sequences)? I will need to know where the "-" is occurring in the line to write an extra piece of code to remove them from the file and stop this error from occurring.

Thanks!

Gert-Jan