Closed emm308 closed 1 month ago
Hello @emm308,
Could you please let me know what version of CRABS you are using (crabs --version
)? Do you see any error messages after downloading the EMBL .dat.gz files?
I've tried the following line of code on crabs version 0.1.8:
crabs db_download --source embl --database 'INV_10*' --output embl_INV.fasta --keep_original no
I limited the download to a couple of files to speed things up, but please find below the output of CRABS to the Terminal window.
downloading sequences from EMBL
.listing [ <=> ] 541.36K 3.11MB/s in 0.2s
STD_INV_10.dat.gz 100%[================================================================>] 57.08M 29.7MB/s in 1.9s
STD_INV_100.dat.gz 100%[================================================================>] 223.19M 28.2MB/s in 7.5s
STD_INV_101.dat.gz 100%[================================================================>] 40.99M 25.7MB/s in 1.6s
STD_INV_102.dat.gz 100%[================================================================>] 88.85M 27.7MB/s in 3.2s
STD_INV_103.dat.gz 100%[================================================================>] 209.25M 31.0MB/s in 6.8s
STD_INV_104.dat.gz 100%[================================================================>] 632.76M 31.1MB/s in 22s
STD_INV_105.dat.gz 100%[================================================================>] 891.79M 28.4MB/s in 32s
STD_INV_106.dat.gz 100%[================================================================>] 2.05G 30.9MB/s in 72s
STD_INV_107.dat.gz 100%[================================================================>] 9.61G 32.9MB/s in 5m 38s
STD_INV_108.dat.gz 100%[================================================================>] 21.83G 9.72MB/s in 21m 54s
STD_INV_109.dat.gz 100%[================================================================>] 4.57G 9.37MB/s in 9m 59s
unzipping file: STD_INV_10.dat
unzipping file: STD_INV_100.dat
unzipping file: STD_INV_101.dat
unzipping file: STD_INV_102.dat
unzipping file: STD_INV_103.dat
unzipping file: STD_INV_104.dat
unzipping file: STD_INV_105.dat
unzipping file: STD_INV_106.dat
unzipping file: STD_INV_107.dat
unzipping file: STD_INV_108.dat
unzipping file: STD_INV_109.dat
formatting STD_INV_10.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████████| 393744159/393744159 [00:04<00:00, 82373033.65it/s]
saving STD_INV_10.fasta
formatting STD_INV_100.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████████| 828404361/828404361 [00:11<00:00, 70580635.35it/s]
saving STD_INV_100.fasta
formatting STD_INV_101.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████████| 216243712/216243712 [00:03<00:00, 70838102.42it/s]
saving STD_INV_101.fasta
formatting STD_INV_102.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████████| 372206285/372206285 [00:05<00:00, 69973149.59it/s]
saving STD_INV_102.fasta
formatting STD_INV_103.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████████| 780542809/780542809 [00:10<00:00, 71089746.61it/s]
saving STD_INV_103.fasta
formatting STD_INV_104.dat to fasta format
100%|██████████████████████████████████████████████████████████████████████████████████| 2206189377/2206189377 [00:31<00:00, 69714759.82it/s]
saving STD_INV_104.fasta
formatting STD_INV_105.dat to fasta format
100%|██████████████████████████████████████████████████████████████████████████████████| 3125974043/3125974043 [00:44<00:00, 69625169.22it/s]
saving STD_INV_105.fasta
formatting STD_INV_106.dat to fasta format
100%|██████████████████████████████████████████████████████████████████████████████████| 7143484933/7143484933 [01:44<00:00, 68250693.03it/s]
saving STD_INV_106.fasta
formatting STD_INV_107.dat to fasta format
100%|███████████████████████████████████████████████████████████████████████████████▉| 33455529418/33455529420 [08:12<00:00, 67939697.23it/s]
saving STD_INV_107.fasta
formatting STD_INV_108.dat to fasta format
100%|███████████████████████████████████████████████████████████████████████████████▉| 75641145745/75641145749 [18:32<00:00, 67979149.66it/s]
saving STD_INV_108.fasta
formatting STD_INV_109.dat to fasta format
100%|████████████████████████████████████████████████████████████████████████████████| 16253328957/16253328957 [03:58<00:00, 68206991.36it/s]
saving STD_INV_109.fasta
Combining all EMBL downloaded fasta files...
Also, the files were transferred appropriately to a single fasta document named EMBL_INV.fasta
. I've placed the first lines of that document below using the head
command:
>KT139629
ATANTTGGAACTTCCTTAAGTCTATTAATCCGAGCTGAATTAGGAAACCCAGGATCTCTA
ATCGGTGATGATCAAATTTATAACACTATTGTTACAGCTCACGCTTTTATTATAATTTTT
TTTATAGTTATACCTATTATAATCGGAGGATTTGGAAATTGATTAGTTCCTTTAATATTA
GGAGCCCCTGATATAGCTTTCCCACGAATAAATAATATAAGATTCTGACTTTTACCCCCA
TCTTTAACTCTTTTAATCTCCAGAAGATTAGCAGAAAATGGAGCAGGAACAGGATGAACA
GTTTACCCCCCCTTATCTTCTAATATTGCCCATAGAGGAAGATCTGTAGACTTAGCCATC
TTTTCTCTCCACTTAGCCGGAATTTCTTCTATTCTTGGTGCTATTAATTTTATTACTACT
ATTATTAATATACGCCCTAATAATATAAGATTTGATCGAATACCATTATTTGTTTGAGCC
GTTGGAATTACAGCTTTATTACTTCTTTTATCTCTTCCTGTATTAGCCGGAGCTATTACT
If your files are in .dat.gz
format, the in silico PCR will indeed fail, as the files are required to be in .fasta
format, and specifically in the CRABS supported .fasta
format.
It would be great if we could resolve this issue for future users. However, I understand that since you already 1T of data, it might be best to transform the .dat.gz files to the correct format and continue with the in silico PCR analysis. Could you please let me know if you have a single file or are there multiple ones?
Best regards, Gert-Jan
In case it is helpful to anyone, I'll share the bit of bash code I used (based on this thread) to download & format the EMBL invertebrate database in smaller chunks:
suf=("1." "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19")
for j in "${suf[@]}"; do
crabs db_download --source embl --database "INV_${j}*" --output "embl_inv_${j}.fasta" --keep_original no
done
Hello @emm308,
Could you please update crabs to the latest version (crabs --version 1.0.0
)? CRABS now downloads the .fasta files from EMBL, rather than the .dat files. Additionally, the download (crabs --download-embl
) and import (crabs --import
) are now separated, which allows for increased flexibility and should resolve your issue. Please reopen this thread if the issue persists.
Best wishes, Gert-Jan
Hello! I am very new to CRABS, but I am running through my bioinformatics protocol within Linux (terminal) and I have successfully downloaded and run NCBI and BOLD databases through in silico PCR. I downloaded EMBL files using crabs db_download --source embl --database 'INV*' --output embl_INV.fasta --keep_original no I have read here that it downloads as a .dat.gz file and automatically converts to a .fasta file, but mine seem to be staying as a .dat.gz file. The final download of all of INV from embl is a total of 1.1T- when I try to run these individual files through in silico PCR, it fails. I am wondering if there is a way to download EMBL straight to .fasta? Also wondering if I am doing something wrong with the in silico pcr. Let me know if I should provide any further code that I am using :) Thank you!