data install location vs cache location

openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl

Apache License 2.0

378 stars 65 forks source link

data install location vs cache location #189

Open 0xaf1f opened 7 years ago

0xaf1f commented 7 years ago

I'm trying to set up pyensembl on a shared system, so I set PYENSEMBL_CACHE_DIR to a central location before running pyensembl install for various datasets. The problem is when I run the program with lower privileges, I see it's also trying to write some (temporary?) files there. While expecting it to be writable makes sense for caching, I think the immutable data should be treated differently and allowed to be placed in a read-only location.

Thanks for your consideration

0xaf1f commented 7 years ago

I've just referenced this issue from https://hpc.nih.gov/apps/agfusion.html#notes

iskandr commented 6 years ago

Hey @0xaf1f -- PyEnsembl used to download GTFs and FASTA files, create some intermediate CSV files and then write out "indexed" forms of the genomic metadata (as a .db file) and sequences (as a .pickle file). I've gotten rid of the intermediate step (CSV files) but the indexed databases still get created -- it wouldn't be possible to do efficient lookups without them. Is there an alternative that you would prefer?

0xaf1f commented 6 years ago

It's been a while, so I don't remember all the issues here. I'm not opposed to the existence of a cache. I was just asking for a separation of persistent and transient files. I'd have to go back and refresh my memory to see what was going on here.