ncbi / egapx

Eukaryotic Genome Annotation Pipeline-External caller scripts and documentation
Other
19 stars 1 forks source link

offline mode? #11

Open OH-AU opened 1 month ago

OH-AU commented 1 month ago

Although I can download most files in advance, it appears that some downloads are hardcoded into the source? Specifically:

./ui/egapx.py:    taxids_file = urlopen("
[https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/target_proteins/taxid.list")](https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/target_proteins/taxid.list%22))
./ui/egapx.py:    return f'
[https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/target_proteins/{best_taxid}.faa.gz'](https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/target_proteins/%7Bbest_taxid%7D.faa.gz%27)

Would it be possible to set it up so that downloads could be run independently in advance and a local directory check to see if the files exist before attempting a download as running on a HPC system, the compute nodes themselves are firewalled. Thanks.

Andy-B-123 commented 1 month ago

Hi, just chiming in that I would love this support as well! Our HPC environment is similarly locked down for internet access for compute nodes.

pstrope commented 1 month ago

Hi,

Thank you for bringing up this issue. We will consider this in our planning for future updates. We will reach out once we have developed the offline mode.

Pooja

victzh commented 1 month ago

In case of compute nodes firewalled from the Internet you can't use HTTP(S) URLs as a data source as well, and you can't use SRA accessions and SRA queries for reads. If that's OK, we can move all the data download to the main node before starting the cluster execution. If the main node is also isolated it makes more problems for us to work around.

OH-AU commented 1 month ago

The above would work for our institute. Many often have a node specifically for the purposes of moving/downloading data, but that node shouldn't be used for any compute. Running a multi-step process would work for me as well e.g. 1) data prep/download 2) analysis etc.

Andy-B-123 commented 1 month ago

Yes, I'd be totally happy running a two-step process!

Generally I would be having read files locally and would be happy running something which would check and update databases as an initial step, and then running the compute heavy workflow afterwards.

Thank you for considering this, it's a block at the moment for implementing this in my workflow too!