greenelab / 2022-microberna

A pipeline to generate a compendia of bacterial and archaeal RNA-seq data
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

documenting whitelisting IP address by ENA for fastq download from ftp #14

Open taylorreiter opened 2 years ago

taylorreiter commented 2 years ago

The ENA generates an ftp download link for sequencing data, providing it in gzipped fastq format. This is often super handy to work with, so this pipeline downloads RNAseq reads that way.

However, the ENA blocks ip addresses that have too many incomplete downloads. This was happening in the early stages of running this pipeline on the summit cluster because I didn't specify a long enough download time, so download rules were cancelled before the download was complete. This could also happen on other clusters if downloads are run as preemptable jobs.

To white list an ip address, you must contact the ip help desk explaining why your IP address was blocked, and telling them what ip address to unblock. https://www.ebi.ac.uk/ena/browser/support

To determine the IP address on the computer you are using, run: curl ifconfig.me

If you're working on a cluster like summit, the ip address on the login node is not the same as the ip address on the compute nodes. To get an ip address for the compute nodes, you can run the same command from the compute node. You can either put it in an sbatch script and look at the slurm*out log file, or you can ssh directly into the node from the login node

ssh smem0101

For some partitions, ssh'ing directly into the node only works if you already have a job running on that partition.