improve image store - Githubissues

transientskp / tkp

A transients-discovery pipeline for astronomical image-based surveys

http://docs.transientskp.org/

BSD 2-Clause "Simplified" License

19 stars 15 forks source link

improve image store #487

Closed gijzelaerr closed 8 years ago

gijzelaerr commented 8 years ago

Currently we store images into a mongodb store without any retention policy or way to remove images from the storage by the end user. This will eventually cause disk space problems. We need to be smarted about this.

Possible strategies are:

Automatically remove the oldest images
Automatically remove images with no reference in te database(s)
Add a script that enables an end user to remove images from mongodb.
Switch to a different type of image storage.

timstaley commented 8 years ago

Yeah, we're having some issues with files not being uploaded to mongo here which I'm currently investigating. Looking though the code, one optimisation is obvious: Use MD5 checksum of the data for indexing and lookup! Currently I think we only index / deduplicate the data based on file-path, which is very prone to multiple copies.

gijzelaerr commented 8 years ago

Sounds very plausible, but we didn't have that problem yet. But I agree some hashing is probably better than based on filename. Still this mostly is more about data retention for us than name collisions. Op 30 nov. 2015 5:39 p.m. schreef "Tim Staley" notifications@github.com:

Yeah, we're having some issues with files not being uploaded to mongo here which I'm currently investigating. Looking though the code, one optimisation is obvious: Use MD5 checksum of the data for indexing and lookup! Currently I think we only index / deduplicate the data based on file-path, which is very prone to multiple copies.

— Reply to this email directly or view it on GitHub https://github.com/transientskp/tkp/issues/487#issuecomment-160663709.

jdswinbank commented 8 years ago

In the TraP database's image table, we store a "URL" for the image. In practice, I think that's always been a filename. While that has obvious downsides, it has two advantages:

It's datastore agnostic. As in, it does not implicitly encode "here's a value you could look up in MongoDB".
It's (more likely to be) human readable. As in, the casual user is more likely to be able to guess what a given filename refers to than they are some hex string.

The orthodoxy has always been that storing things in MongoDB is a "quick hack", so I'd be reluctant to make the TraP schema depend on it explicitly. However, it might be a worthwhile exercise to define some URL-like schema along the lines of mongodb://host/hash.

timstaley commented 8 years ago

So my view here is that we keep the filename url, and use that for displaying in Banana. But we add the md5sum, which should provide a much saner way of uniquely identifying file-data blobs (and is not specific to MongoDB).

gijzelaerr commented 8 years ago

to clarify, I once came up with the field "URL". URL stands for Uniform Resource Locator, which doesn't necessarily needs to be a HTTP link. But probably the name is confusing and we don't use it properly also (it is missing the scheme).

Also, we are using paths, not filenames to store the blobs.

MD5 sounds like a good idea, but it doesn't address the original issue here. Also, as far as I know, we didn't have any path duplication issues. I agree that it looks error prone, but in practice it didn't happen yet as far as I know.

We need a way to automatically prune old elements. For that we may need to store other metadata also. Or we need some logic that extracts filenames/checksumes from the SQL database given certain criteria and then remove those entries from the blob store.

gijzelaerr commented 8 years ago

In the latest version we now store the images in the postgresql database in the fits_data column. This eliminates the need for tracking the hashs of the images. It is now also quite easy to remove old data. I also implemented a drop dataset command.