spacetx / starfish

starfish: unified pipelines for image-based transcriptomics
https://spacetx-starfish.readthedocs.io/en/latest/
MIT License
225 stars 67 forks source link

json files not cached #917

Open ttung opened 5 years ago

ttung commented 5 years ago

The JSON files in starfish do not get cached because we don't have a place to record the file checksums.

I think this is mostly a developer happiness issue, but @ambrosejcarr let me know if you feel differently.

ambrosejcarr commented 5 years ago

I was going to make an issue about this -- Thanks for beating me to it! It'd be nice if they could get cached because then users could use starfish offline with data they've previously downloaded. I think this would be a great feature.

Edit: I understand there is more complexity here (it's important to check that the data hasn't changed using the remote checksums) but with cached json the locally stored objects would be adequate to use starfish, which is better than explicitly downloading a second copy of the data from s3. Some combination of flags (e.g. --use-offline or similar) would be a helpful thing.

Edit 2: despite this, I agree with your tags.

ttung commented 5 years ago

If we want to do that, I suspect we would need to kludge it by having the filenames of the json files be replaced with a dict { "name": xxx, "sha256": yyy }. Note that the top-level json (experiment.json) would still be uncached, and we might need something like starfish.experiment.Experiment.from_json("some_path", sha256="yyy"), which is pretty icky.

ambrosejcarr commented 5 years ago

Got it. If we tarred up the experiment as we'd been discussing, might that solve this problem at the same time?

ttung commented 5 years ago

Not really.

Another possibility: starfish caches json files entirely based on URLs, and only when the environment variable STARFISH_OFFLINE is defined, then it consults this special cache.

ttung commented 5 years ago

It's pretty unclear how we can make this work with the current code & abstractions.

I think it's relatively straightforward to make slicedimage cache json files internally, though.

ambrosejcarr commented 5 years ago

I think it's relatively straightforward to make slicedimage cache json files internally, though.

What are the implications of this? Why is this true and why, given this, do our abstractions prevent us from caching the json files?

ttung commented 5 years ago

Tweaking the slicedimage internals to generate and store checksums is straightforward. It's not as straightforward to propagate those checksums to someone calling the top-level APIs in slicedimage.

ttung commented 5 years ago

What are the implications of this?

Not much. It means things like top-level constructs will not be cached, but mid-level constructs (TileSets in the current hierarchy) will be.

joshmoore commented 5 years ago

@ttung would you implement https://github.com/spacetx/starfish/issues/917#issuecomment-450262435 by having the look-up take place via the checksums themselves? i.e. user calls resolve_url with a path; library gets remote checksum for path and queries cache

joshmoore commented 5 years ago

cF: suggest for a new json caching method under https://github.com/spacetx/starfish/issues/912#issuecomment-449276571