EMBL database size issue

saraswaminathan commented 1 month ago

Hi there, I am trying to download the plant portion of the EMBL database. I left the download running overnight because it's so large, it takes a long time. It failed because I ran out of space on my computer. I've now made more space on my harddrive, and moved my crabs folder so that it syncs to a cloud drive.

Is there a way to expedite the download, so that the --download-embl command skips downloads if they're already present in the --output folder? Or is there a way to identify which EMBL accessions have not yet been downloaded, and then specifically download just those? I had to re-start the process, but it's overwriting files I had already downloaded yesterday. Also, this way, if it fails again due to running out of space, I could begin again without losing progress.

Thank you! Sara

gjeunen commented 1 month ago

Hello @saraswaminathan,

When CRABS downloads the EMBL files, there should be a number associated with them. For example:

crabs --download-embl --output embl_mam.fasta --taxon 'mam*'

Downloads 10 files from the EMBL database. These files follow the format STD_MAM_#.fasta.gz, where # represents a number between 1 and 10.

Once all 10 files are downloaded, CRABS will combine the 10 files and write them to embl_mam.fasta. This marks the successful end of the code. The reason CRABS downloads 10 files, is that the EMBL database is set up in a way, so that each file contains 10,000 sequences. Once the first STD_MAM_1.fasta.gz file contains 10,000 sequences a new file (STD_MAM_2.fasta.gz) is created on the database for newer barcodes. We know the number of files for MAM is 10, as CRABS reports how many files are being downloaded. For PLN, this should be 50 (?) files.

Unfortunately, EMBL doesn't allow for easy checking on which sequences were previously downloaded and skip those. (There is a way using the lists provided on the EMBL database, but it would require quite extensive coding to accomplish). However, we can make use of the numbers in the files that were previously downloaded and focus on downloading all other ones.

Let's say that we wanted to download the 10 mammal files on the EMBL database, but that our connection failed during the download of the fourth file (i.e., STD_MAM_4.fasta.gz). I recommend deleting this fourth file, as the download most likely wasn't finished/successful. Once the fourth file is deleted, we can write a for loop to tell CRABS to download the last 7 files. Example code:

for i in {4..10}; do crabs --download-embl --output STD_MAM_${i}.fasta --taxon mam_${i}; done

In this for loop, please note that I'm specifying new output files that contain the number of the download. This stops CRABS from overwriting the output files when new files are downloaded in the for loop. Please also note that I'm using an underscore (_) after mam and before specifying the number, which is essential to match the names.

For the files that were previously downloaded by CRABS before the download was interrupted, you can simply unzip them yourself, which is what CRABS normally does automatically. Example code:

for gz in *.fasta.gz; do gunzip ${gz}; done

If you follow these steps, you should end up with 10 files, from STD_MAM_1.fasta to STD_MAM_10.fasta`.

Once everything is downloaded and unzipped, you can import the files into CRABS separately and merge them using the --import and --merge functions, respectively. Example code:

crabs --download-taxonomy

for i in {1..10}; do crabs --import --import-format embl --names names.dmp --nodes nodes.dmp --acc2tax nucl_gb.accession2taxid --input STD_MAM_${i}.fasta --output STD_MAM_${i}.txt --ranks 'superkingdom;phylum;class;order;family;genus;species'; done

crabs --merge --output 'embl_mam.txt' --uniq --input 'STD_MAM_1.txt;STD_MAM_2.txt;STD_MAM_3.txt;STD_MAM_4.txt;STD_MAM_5.txt;STD_MAM_6.txt;STD_MAM_7.txt;STD_MAM_8.txt;STD_MAM_9.txt;STD_MAM_10.txt'

Please change the code accordingly for your use-case. I hope this helps.

Best wishes, Gert-Jan

gjeunen commented 1 month ago

In the next CRABS update, I will incorporate regex support for the --taxon parameter of the --download-embl function, as well as the --input parameter of the --merge function to simplify the code.

Best wishes, Gert-Jan

saraswaminathan commented 3 weeks ago

Hi Gert-Jan, Thank you so much! Your solution worked perfectly. However, because of the shear magnitude of the EMBL plant database, pairwise global alignment with my gene of interest is taking forever (only 2% completed in 24 hours). I tried dereplicating and filtering prior to pairwise global alignment, but my database till contains 3million+ sequences. Can you explain what exactly the size-select parameter does in the pairwise global alignment function so I can decide how much I can reduce this by?

Thank you so much for all your thoughtful feedback - I really appreciate your help! Sara

gjeunen commented 3 weeks ago

Hello @saraswaminathan,

Great to hear this worked!

For speeding up the --pairwise-global-alignment step we have 4 tips and tricks:

To run --dereplicate and --filter prior to --in-silico-pcr. This should significantly reduce the number of sequences that would otherwise already be removed during these steps for --pairwise-global-alignment. Please see section 5.4.1 in the README document for more information.
Since --pairwise-global-alignment supports multithreading, running this code on a cluster or supercomputer will significantly increase the speed of execution.
Invoke the --size-select parameter, which removes sequences longer than N from the analysis. So, if --size-select 1000 is provided to CRABS, all sequences longer than 1,000 bp will be discarded for the --pairwise-global-align function. The reasoning behind this is the following: let's assume that when both primer-binding regions are in the sequence, the --in-silico-pcr function has extracted the amplicon for us. This means that what remains in the original file are (i) sequences that do not cover the barcode, (ii) sequences that cover the barcode but miss one primer-binding region, and (iii) sequences that cover the barcode but miss both primer-binding regions. Since we want to extract (ii) and (iii) from the file using --pairwise-global-align and we assume barcodes are created using Sanger sequencing, we can restrict the number of sequences we investigate by setting a length threshold of 2 x the max Sanger sequencing length (when a barcode was build using forward and reverse sequencing and used one of your target primers to generate the barcode). This is of course an approximation and some of our assumption will not always hold true, but it would speed up the code without the risk of losing too many barcodes.
If your aim is not to extract all barcodes to capture all possible intra-specific variation for each species, you can subset the initial file by removing all sequences for species for which you have retrieved a barcode from the --in-silico-pcr function. This is currently not yet supported in CRABS, but I will make this a filtering parameter for --pairwise-global-alignment in the next CRABS update. For now, you can extract the species names from the --in-silico-pcr file and use this list in the --subset function.

I hope these tips help.

Best wishes, Gert-Jan

gjeunen / reference_database_creator

EMBL database size issue #74