Closed jeromekelleher closed 5 years ago
As a user, I like this as a feature. I leave the caching/downloading infrastructure decisions to you.
Does user provide the url or are the urls stored internally? If urls are stored internally, it would be good to have a command to see what maps are available.
Does user provide the url or are the urls stored internally? If urls are stored internally, it would be good to have a command to see what maps are available.
The URLs are stored internally (but a user could easily define their own map by subclassing GeneticMap). The available maps would be viewable on the documentation (hosted on read the docs) as each one is a class. Having a function to return all the maps is a good idea, we should do that.
One question is what to do with more bespoke maps that aren’t available from a public resource like HapMap. For instance the fly maps that I included in the drosphila_melanogaster
module are currently sitting on one of my servers at UO. While this works and is fine in the short term we should think about longer term solutions for serving up these files. Any thoughts?
Sent with GitHawk
I think what we need to do ultimately is collect all the maps we want to support in one place and serve them from a single location. Ideally we'd store them as a AWS Public Dataset so we would get high-availability and wouldn't have to worry about bandwidth. This would be particularly useful if people were doing large scale simulations on the cloud (for machine learning, say).
For now storing them where ever is fine though.
I was trying to figure out if a user can access an AWS Public Dataset without paying for an instance to access the data from. Does that end up being free for everyone or does someone have to pay?
I was trying to figure out if a user can access an AWS Public Dataset without paying for an instance to access the data from. Does that end up being free for everyone or does someone have to pay?
I think these are free for anyone to download from anywhere. I randomly went into one of the datasets and you can see this example.
I think everyone is happy that this is the right basic approach, so I'm going to close the issue.
I've made some initial infrastructure to transparently download genetic maps on demand, and store them in a cache for future use. The code is here. The idea is that you define a subclass of the genetic map which gives a URL where it can downloaded. When you ask for a specific genetic map for a chromosome, it first checks the cache to see if the genetic map has already been downloaded. If so, you load the map directly from the cache. If not, you download the genetic map from its URL. I've implemented this for the HapMapII genetic map in humans, and it seems to work pretty well. In use, it looks like
There's some more work to be done in figuring out how this interacts with the chromosome definitions (which is currently clunky), but I think the basic infrastructure for downloading and managing the maps is good. It would be good to get some opinions on whether this is a good idea before I go any further with it: any opinions @popgensims/all?
The reason we need this sort of caching infrastructure is because the maps are too big to bundle with the code. The gzipped HapMap genetic map is 35M, which is already too much to bundle with a Python package. Multiply this by several different maps across multiple species and it's definitely way too much.
If we follow this approach, it might be worth thinking about putting all the maps that we use in one location --- this would surely be an easy thing to convince Amazon to store as a public dataset.