Closed gijzelaerr closed 8 years ago
Yeah, we're having some issues with files not being uploaded to mongo here which I'm currently investigating. Looking though the code, one optimisation is obvious: Use MD5 checksum of the data for indexing and lookup! Currently I think we only index / deduplicate the data based on file-path, which is very prone to multiple copies.
Sounds very plausible, but we didn't have that problem yet. But I agree some hashing is probably better than based on filename. Still this mostly is more about data retention for us than name collisions. Op 30 nov. 2015 5:39 p.m. schreef "Tim Staley" notifications@github.com:
Yeah, we're having some issues with files not being uploaded to mongo here which I'm currently investigating. Looking though the code, one optimisation is obvious: Use MD5 checksum of the data for indexing and lookup! Currently I think we only index / deduplicate the data based on file-path, which is very prone to multiple copies.
— Reply to this email directly or view it on GitHub https://github.com/transientskp/tkp/issues/487#issuecomment-160663709.
In the TraP database's image table, we store a "URL" for the image. In practice, I think that's always been a filename. While that has obvious downsides, it has two advantages:
The orthodoxy has always been that storing things in MongoDB is a "quick hack", so I'd be reluctant to make the TraP schema depend on it explicitly. However, it might be a worthwhile exercise to define some URL-like schema along the lines of mongodb://host/hash
.
So my view here is that we keep the filename url, and use that for displaying in Banana. But we add the md5sum, which should provide a much saner way of uniquely identifying file-data blobs (and is not specific to MongoDB).
to clarify, I once came up with the field "URL". URL stands for Uniform Resource Locator, which doesn't necessarily needs to be a HTTP link. But probably the name is confusing and we don't use it properly also (it is missing the scheme).
Also, we are using paths, not filenames to store the blobs.
MD5 sounds like a good idea, but it doesn't address the original issue here. Also, as far as I know, we didn't have any path duplication issues. I agree that it looks error prone, but in practice it didn't happen yet as far as I know.
We need a way to automatically prune old elements. For that we may need to store other metadata also. Or we need some logic that extracts filenames/checksumes from the SQL database given certain criteria and then remove those entries from the blob store.
In the latest version we now store the images in the postgresql database in the fits_data column. This eliminates the need for tracking the hashs of the images. It is now also quite easy to remove old data. I also implemented a drop dataset command.
Currently we store images into a mongodb store without any retention policy or way to remove images from the storage by the end user. This will eventually cause disk space problems. We need to be smarted about this.
Possible strategies are: