vsivsi / meteor-file-collection

Extends Meteor Collections to handle file data using MongoDB gridFS.
http://atmospherejs.com/vsivsi/file-collection
Other
159 stars 37 forks source link

Data deduplication / CAS storage #163

Open killua-eu opened 7 years ago

killua-eu commented 7 years ago

Hi, a quick question: does meteor-file-collection have data deduplication based on hash comparisons buit-in, or in other words, is it a content-adressable storage? Did you consider choosing a different hash (i.e. sha256) or adding another one to protect from (quite theoretical for a file based storage, I admit) md5 collisions? Thanks in advance, P.

vsivsi commented 7 years ago

Thanks for your question. No it doesn't implement de-dup/CAS. This package is meant to be a (relatively) simple package to expose the MongoDB gridFS implementation to Meteor, and gridFS does not implement these features. The use of MD5 is specified by the gridFS specification and implemented in the MongoDB drivers and DB server software.

You are free to add another hash (or any other information) as file metadata to meet the needs of your application. It should be possible/straighforward to implement a CAS/dedup solution on top of gridFS, and this has probably already been done (or could be added easily enough by writing a MongoDB backend for one of the many such systems that are under active development (e.g. Dat, Noms, Restic, etc.) But fileCollection will not support any of these directly because it is outside the scope of this project to do so.

killua-eu commented 7 years ago

Thanks for the answers! As for the CAS/dedup - I'm didn't read enough gridFS related docs, but isn't the dedup there kind of by default (though only relying on md5)? ... i can use md5 to request a file from girdFS via fileCollection, so I was assuming, that fileCollection would, upon inserting the same data under a different filename, just ignore the data and just add a new filename - basically deduplication upon insertion.

vsivsi commented 7 years ago

Nope. There's no deduping or reference counting or anything in gridFS. Each file has its own chunks regardless of the MD5 sum. And the chunks themselves are not deduped or individually hashed in any way. It's very simple. So simple in fact that it is not inherently safe for concurrent writes (e.g. there is no locking of any kind). MongoDB has been "talking" about redesigning it for years, but I've seen no recent progress on that either.

vsivsi commented 7 years ago

To clarify: not safe for concurrent writes/reads to any given file. FileCollection does implement locking on top of MongoDB to make such operations safe, although more recently MongoDB has actually been de-featured in this respect (rather than fixing it) because of the mythical replacement technology that has yet to materialize.

killua-eu commented 7 years ago

Aah, thats a bit of a letdown on Mongo's side. Please correct me if I'm wrong, could a "poor-mans" dedup-on-write be simply implemented as a few-liner if one would query FileCollection first for the md5 to be written and decide if to write it or if to only update references?

vsivsi commented 7 years ago

Sure, if all you want is file-level dedup, that could work (probably a bit more than a "few liner" though). You'd need to implement reference counting and ensure that the inc/dec logic is safe for concurrent operations, and come up with a scheme to implement "per copy" metadata, etc. In general, gridFS is very simple, but it is also pretty flexible in terms of making it possible for lots of higher level functionality to be built on top at the application level, precisely because it specifies so little.

killua-eu commented 7 years ago

Oh, I believed/hoped that I could rely on Mongo for concurrency safety. Thanks for the info, I'll have to read a bit more on gridFS to figure out the limitations.

vsivsi commented 7 years ago

You should check out my gridFS locking package (and the sister gridFS streaming package). Lots of good info there, and file-collection is built on top of it.

https://github.com/vsivsi/gridfs-locks

killua-eu commented 7 years ago

Lovely! Thanks lots for all the infos :)