gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
24 stars 8 forks source link

in_silico command error #21

Closed Guillermouceda closed 1 year ago

Guillermouceda commented 1 year ago

Hello,

I have installed CRABS through conda

so I have download sequences for the ITS-2 region from ncbi with the following command:

crabs db_download --source ncbi --database nucleotide --query 'txid35493[ORGN] AND (ITS2[All Fields] OR ITS-2[All Fields] OR "Internal Transcribed Spacer 2"[All Fields]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]' --output ITS-2-Streptophyta_ncbi.fasta --email guillermoug1203@gmail.com --batchsize 5000 --keep_original yes

the download went OKAY. Now I want to extract the amplicon sequences for the primer I used with the following command:

crabs insilico_pcr --input ITS-2-Streptophyta_ncbi.fasta --output in_silico_streptophyta-ITS-2 --fwd CAWCGATGAAGAACGYAGC --rev RGTTTCTTTTCCTCCGCTTA --error 4.5

I am getting an error message however then there is an output. See the screenshot of my screen:

reading ITS-2-Streptophyta_ncbi.fasta into memory 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 346407515/346407515 [00:01<00:00, 175969411.00it/s] found 372580 sequences in ITS-2-Streptophyta_ncbi.fasta running in silico PCR on fasta file containing 372580 sequences ERROR: Traceback (most recent call last): File "/home/guillermo_u/anaconda3/envs/CRABS/lib/python3.6/site-packages/cutadapt/pipeline.py", line 574, in run for chunk_index, chunk in enumerate(dnaio.read_chunks(f, self.buffer_size)): File "/home/guillermo_u/anaconda3/envs/CRABS/lib/python3.6/site-packages/dnaio/chunks.py", line 80, in read_chunks raise OverflowError('FASTA/FASTQ record does not fit into buffer') OverflowError: FASTA/FASTQ record does not fit into buffer

ERROR: Traceback (most recent call last): File "/home/guillermo_u/anaconda3/envs/CRABS/lib/python3.6/site-packages/cutadapt/pipeline.py", line 641, in run raise e OverflowError: FASTA/FASTQ record does not fit into buffer

Traceback (most recent call last): File "/home/guillermo_u/anaconda3/envs/CRABS/bin/cutadapt", line 10, in sys.exit(main_cli()) File "/home/guillermo_u/anaconda3/envs/CRABS/lib/python3.6/site-packages/cutadapt/main.py", line 877, in main_cli main(sys.argv[1:]) File "/home/guillermo_u/anaconda3/envs/CRABS/lib/python3.6/site-packages/cutadapt/main.py", line 942, in main stats = r.run() File "/home/guillermo_u/anaconda3/envs/CRABS/lib/python3.6/site-packages/cutadapt/pipeline.py", line 841, in run raise e OverflowError: FASTA/FASTQ record does not fit into buffer counting the number of sequences found by in silico PCR 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1185142/1185142 [00:00<00:00, 191754265.76it/s] found primers in 3032 sequences reading sequences without primer-binding regions into memory 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22945399/22945399 [00:00<00:00, 280356618.15it/s] reverse complementing 37349 untrimmed sequences running in silico PCR on 37349 reverse complemented untrimmed sequences counting the number of sequences found by in silico PCR 0it [00:00, ?it/s] found primers in 0 sequences

What is causing the problem? I am sure it is not a problem of the installation cuz I have run the same command with another dataset during the same day

Thank you very much in advance

gjeunen commented 1 year ago

Hello @Guillermouceda,

The error is pointing to memory limitations. Could you increase the memory available to the program and try to run again?

Best regards, Gert-Jan

Guillermouceda commented 1 year ago

Hello Gert_Jan,

Thank you for your quick reply

I am running CRABS in WSL with UBUNTU 22 distribution. My laptop has 16 GB of RAM and a processor Intel Core I7. Do you think this is enough ?

I have tried to do what you recommended to me. I have prioritize RAM usage by WSL, however still I got the same error.On the other hand I run in_silico PCR with following command:

crabs insilico_pcr --input embl_plants.fasta --output in_silico_embl_plants-ITS-2 --fwd CAWCGATGAAGAACGYAGC --rev RGTTTCTTTTCCTCCGCTTA --error 4.5

This worked fine:

reading embl_plants.fasta into memory 100%|█████████████████████████████████████████████████████████████████████████| 4101591495/4101591495 [00:32<00:00, 126805238.08it/s]found 4116703 sequences in embl_plants.fasta running in silico PCR on fasta file containing 4116703 sequences counting the number of sequences found by in silico PCR 100%|██████████████████████████████████████████████████████████████████████████████| 11405881/11405881 [00:00<00:00, 86529020.67it/s]found primers in 26848 sequences reading sequences without primer-binding regions into memory 100%|█████████████████████████████████████████████████████████████████████████| 4064268911/4064268911 [00:09<00:00, 425305252.86it/s]reverse complementing 4089855 untrimmed sequences running in silico PCR on 4089855 reverse complemented untrimmed sequences counting the number of sequences found by in silico PCR 100%|██████████████████████████████████████████████████████████████████████████████████| 621620/621620 [00:00<00:00, 86981259.47it/s]found primers in 190 sequences

This dataset has more sequences than the one is failing:

reading ITS-2-Streptophyta_ncbi.fasta into memory 100%|███████████████████████████████████████████████████████████████████████████| 346407515/346407515 [00:00<00:00, 362636437.64it/s]found 372580 sequences in ITS-2-Streptophyta_ncbi.fasta running in silico PCR on fasta file containing 372580 sequences

Why is this happering ?

Thanks in advance

gjeunen commented 1 year ago

Hello @Guillermouceda,

Apologies, I might have pointed you in the wrong direction. The error is originating from the 3rd-party software cutadapt, not CRABS itself. Based on their code, the error is pointing towards the size of the sequence that is being read into memory (more info: https://github.com/marcelm/dnaio/blob/main/src/dnaio/chunks.py). Does the first file, where the error is occurring, contain sequences with full length genomes? If so, these might cause the memory error on your system and you might want to remove those.

Hopefully, this will solve the issue. If the error still persists, please let me know, and I'll look into it further.

Best regards, Gert-Jan

Guillermouceda commented 1 year ago

Hello Gert-Jan,

I just downloaded sequences from NCBI and EMBL . The code I used for downloading the seqs from NCBI was the following:

crabs db_download --source ncbi --database nucleotide --query 'txid35493[ORGN] AND (ITS2[All Fields] OR ITS-2[All Fields] OR "Internal Transcribed Spacer 2"[All Fields]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]' --output ITS-2-Streptophyta_ncbi.fasta --email guillermoug1203@gmail.com --batchsize 5000 --keep_original yes

I want to downloaded all the seqs for the ITS-2 region of the stated taxon (i.e. Streptophyta). Thus, if this query is correct there should not be whole genomes among the sequences . Is the query code correct? Should I include in the query a filter for length?

Then since I want my database to be as complete as possible I downloaded the sequences from embl:

crabs db_download --source embl --database 'pln*' --output embl_plants.fasta --keep_original yes

Are these two queries correct?

In theory I should not have whole genomes

Thank you very much !!

gjeunen commented 1 year ago

Hello @Guillermouceda,

I entered your query on the NCBI website to check what is included during the db_download of your query, as I'm not familiar with this taxonomic group. You can do this by copy-pasting what you entered in the --query parameter on the nucleotide search bar on NCBI (screenshot attached). This screenshot shows that CRABS downloaded 437752 sequences if no additional sequences were deposited between now and when you downloaded the data. Also, when sorting by length, the screenshot reveals that multiple genome sequences were downloaded. Can you please confirm this in your file ITS-2-Streptophyta_ncbi.fasta?

If these sequences are in your file, I would guess that they are causing your memory error. Based on the online search, there are only 12 sequences you would need to remove. I pasted the accession numbers below, so that you can remove those. The queries you used are correct, but I would suggest to use a length filter for NCBI searches in the future and set it to the maximum size of your target, e.g., full mitogenome length for mitochondrial markers (~20kb for vertebrates), etc.

Please let me know if those genome sequences are in your input file and if the error persists after removing those sequences.

Accession numbers to remove:

Best regards, Gert-Jan

Screenshot 2023-05-03 at 10 51 55 PM
Guillermouceda commented 1 year ago

Hi Gert-Jan,

I also went myself over the NCBI file and I also discovered them. Hopefully I do not have any problems now.

What about something like that ?

image

I have realized that I have some more sequences like that !

Thank you for your support

gjeunen commented 1 year ago

Hello @Guillermouceda,

Those sequences should be fine, they are only ~5kb long. It's those sequences that are >1MB that can cause issues with limited memory availability. Hopefully, the code will work now. Please let me know if the issue isn't solved.

Best regards, Gert-Jan

Guillermouceda commented 1 year ago

Hi Gert-Jan, The insilico command worked fine after fixing the ncbi.fasta. Now I am having problems with the pga command. I guess it is a problem of memory.

This is the code I used:

crabs pga --input embl_plants.fasta --output pga_embl_ITS-2 --database in_silico_embl_plants-ITS-2 --fwd CAWCGATGAAGAACGYAGC --rev RGTTTCTTTTCCTCCGCTTA --speed medium --percid 0.95 --coverage 0.95 --filter_method strict

reading in_silico_embl_plants-ITS-2 into memory 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 11730205/12027501 [00:00<00:00, 61627857.96it/s] found 27038 number of sequences in in_silico_embl_plants-ITS-2 reading embl_plants.fasta into memory 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 4056368830/4101591495 [01:20<00:00, 50489729.19it/s] found 4116703 number of sequences in embl_plants.fasta analysing 4089665 number of sequences for pairwise global alignment 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4060720950/4060720950 [03:21<00:00, 20175836.46it/s] subsetting embl_plants.fasta by discarding sequences longer than 10,000 bp found 4062038 (99.32%) number of sequences in embl_plants.fasta shorter than 10,000 bp running pairwise global alignment on 4062038 number of sequences. This may take a while! Killed

The process got killed. I guess I should better used a computing grid to run the code on these files.

Thank you very much

gjeunen commented 1 year ago

Hello @Guillermouceda,

The number of sequences you are trying to pairwise compare will require a lot of memory and take a very long time. Maybe try splitting up the files?

Best regards, Gert-Jan

Guillermouceda commented 1 year ago

Hello,

Thank you for your answer. Is CRABS able to do that? Also, I have another question. In my original ncbi.fasta file there were some species I am interested with that after running insilico and pga were lost. Is there a way to get then back ?

Regards

gjeunen commented 1 year ago

Hello @Guillermouceda,

Unfortunately, there is currently no implementation within CRABS to do this automatically. You could split the file up manually or by writing a loop function in bash or python.

If species dropped out during the insilico_pcr analysis, it indicates that primer-binding regions were not observed in the downloaded sequences for this species. Furthermore, if the pga function does not find the amplicon region for these species as well, it indicates that the amplicon (without primer-binding region) is not incorporated into the initially downloaded sequence. Adding those initially downloaded sequences back into your curated reference database won't be of use, as your primer set you're using won't generate the sequence that is available for your species of interest. Hence, these species of interest do not have a reference barcode in the online database yet and can only be included by novel barcoding efforts.

Please let me know if you have any further questions.

Best regards, Gert-Jan