gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
21 stars 8 forks source link

EMBL issues with in silico PCR #58

Open emm308 opened 2 months ago

emm308 commented 2 months ago

Hello! I am very new to CRABS, but I am running through my bioinformatics protocol within Linux (terminal) and I have successfully downloaded and run NCBI and BOLD databases through in silico PCR. I downloaded EMBL files using crabs db_download --source embl --database 'INV*' --output embl_INV.fasta --keep_original no I have read here that it downloads as a .dat.gz file and automatically converts to a .fasta file, but mine seem to be staying as a .dat.gz file. The final download of all of INV from embl is a total of 1.1T- when I try to run these individual files through in silico PCR, it fails. I am wondering if there is a way to download EMBL straight to .fasta? Also wondering if I am doing something wrong with the in silico pcr. Let me know if I should provide any further code that I am using :) Thank you!

gjeunen commented 2 months ago

Hello @emm308,

Could you please let me know what version of CRABS you are using (crabs --version)? Do you see any error messages after downloading the EMBL .dat.gz files?

I've tried the following line of code on crabs version 0.1.8:

crabs db_download --source embl --database 'INV_10*' --output embl_INV.fasta --keep_original no

I limited the download to a couple of files to speed things up, but please find below the output of CRABS to the Terminal window.

downloading sequences from EMBL
.listing                                [ <=>                                                             ] 541.36K  3.11MB/s    in 0.2s    
STD_INV_10.dat.gz                   100%[================================================================>]  57.08M  29.7MB/s    in 1.9s    
STD_INV_100.dat.gz                  100%[================================================================>] 223.19M  28.2MB/s    in 7.5s    
STD_INV_101.dat.gz                  100%[================================================================>]  40.99M  25.7MB/s    in 1.6s    
STD_INV_102.dat.gz                  100%[================================================================>]  88.85M  27.7MB/s    in 3.2s    
STD_INV_103.dat.gz                  100%[================================================================>] 209.25M  31.0MB/s    in 6.8s    
STD_INV_104.dat.gz                  100%[================================================================>] 632.76M  31.1MB/s    in 22s     
STD_INV_105.dat.gz                  100%[================================================================>] 891.79M  28.4MB/s    in 32s     
STD_INV_106.dat.gz                  100%[================================================================>]   2.05G  30.9MB/s    in 72s     
STD_INV_107.dat.gz                  100%[================================================================>]   9.61G  32.9MB/s    in 5m 38s  
STD_INV_108.dat.gz                  100%[================================================================>]  21.83G  9.72MB/s    in 21m 54s 
STD_INV_109.dat.gz                  100%[================================================================>]   4.57G  9.37MB/s    in 9m 59s  
unzipping file: STD_INV_10.dat
unzipping file: STD_INV_100.dat
unzipping file: STD_INV_101.dat
unzipping file: STD_INV_102.dat
unzipping file: STD_INV_103.dat
unzipping file: STD_INV_104.dat
unzipping file: STD_INV_105.dat
unzipping file: STD_INV_106.dat
unzipping file: STD_INV_107.dat
unzipping file: STD_INV_108.dat
unzipping file: STD_INV_109.dat
formatting STD_INV_10.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████████| 393744159/393744159 [00:04<00:00, 82373033.65it/s]
saving STD_INV_10.fasta
formatting STD_INV_100.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████████| 828404361/828404361 [00:11<00:00, 70580635.35it/s]
saving STD_INV_100.fasta
formatting STD_INV_101.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████████| 216243712/216243712 [00:03<00:00, 70838102.42it/s]
saving STD_INV_101.fasta
formatting STD_INV_102.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████████| 372206285/372206285 [00:05<00:00, 69973149.59it/s]
saving STD_INV_102.fasta
formatting STD_INV_103.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████████| 780542809/780542809 [00:10<00:00, 71089746.61it/s]
saving STD_INV_103.fasta
formatting STD_INV_104.dat to fasta format
100%|██████████████████████████████████████████████████████████████████████████████████| 2206189377/2206189377 [00:31<00:00, 69714759.82it/s]
saving STD_INV_104.fasta
formatting STD_INV_105.dat to fasta format
100%|██████████████████████████████████████████████████████████████████████████████████| 3125974043/3125974043 [00:44<00:00, 69625169.22it/s]
saving STD_INV_105.fasta
formatting STD_INV_106.dat to fasta format
100%|██████████████████████████████████████████████████████████████████████████████████| 7143484933/7143484933 [01:44<00:00, 68250693.03it/s]
saving STD_INV_106.fasta
formatting STD_INV_107.dat to fasta format
100%|███████████████████████████████████████████████████████████████████████████████▉| 33455529418/33455529420 [08:12<00:00, 67939697.23it/s]
saving STD_INV_107.fasta
formatting STD_INV_108.dat to fasta format
100%|███████████████████████████████████████████████████████████████████████████████▉| 75641145745/75641145749 [18:32<00:00, 67979149.66it/s]
saving STD_INV_108.fasta
formatting STD_INV_109.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████| 16253328957/16253328957 [03:58<00:00, 68206991.36it/s]
saving STD_INV_109.fasta
Combining all EMBL downloaded fasta files...

Also, the files were transferred appropriately to a single fasta document named EMBL_INV.fasta. I've placed the first lines of that document below using the head command:

>KT139629
ATANTTGGAACTTCCTTAAGTCTATTAATCCGAGCTGAATTAGGAAACCCAGGATCTCTA
ATCGGTGATGATCAAATTTATAACACTATTGTTACAGCTCACGCTTTTATTATAATTTTT
TTTATAGTTATACCTATTATAATCGGAGGATTTGGAAATTGATTAGTTCCTTTAATATTA
GGAGCCCCTGATATAGCTTTCCCACGAATAAATAATATAAGATTCTGACTTTTACCCCCA
TCTTTAACTCTTTTAATCTCCAGAAGATTAGCAGAAAATGGAGCAGGAACAGGATGAACA
GTTTACCCCCCCTTATCTTCTAATATTGCCCATAGAGGAAGATCTGTAGACTTAGCCATC
TTTTCTCTCCACTTAGCCGGAATTTCTTCTATTCTTGGTGCTATTAATTTTATTACTACT
ATTATTAATATACGCCCTAATAATATAAGATTTGATCGAATACCATTATTTGTTTGAGCC
GTTGGAATTACAGCTTTATTACTTCTTTTATCTCTTCCTGTATTAGCCGGAGCTATTACT

If your files are in .dat.gz format, the in silico PCR will indeed fail, as the files are required to be in .fasta format, and specifically in the CRABS supported .fasta format.

It would be great if we could resolve this issue for future users. However, I understand that since you already 1T of data, it might be best to transform the .dat.gz files to the correct format and continue with the in silico PCR analysis. Could you please let me know if you have a single file or are there multiple ones?

Best regards, Gert-Jan