mongoid / mongoid-grid_fs

A pure Mongoid/Moped implementation of the MongoDB GridFS specification.
Other
83 stars 51 forks source link

Can't save the same filename file #36

Closed lajunta closed 10 years ago

lajunta commented 10 years ago

I got this error:

Problem: Validation of Mongoid::GridFS::Fs::File failed. Summary: The following errors were found: Filename is already taken Resolution: Try persisting the document with valid data or remove the validations.

It seems that filename must be unique when saving file to grid

How to solve it? Please help.

Can you provide a way to bypass the filename validation?

mattconnolly commented 10 years ago

This validation should not be in the gem. There should be no requirement for the filenames to be unique. They are optional metadata.

http://docs.mongodb.org/manual/reference/gridfs/

I have an updated fork where I have disabled this validation, but am waiting for an update of mongoid gem for some upstream changes before sending the pull request (for my other changes).

lajunta commented 10 years ago

Thanks matt , I will save it to metadata temporarily.

But I still hope I can use filename .

mattconnolly commented 10 years ago

Please leave this open. This is a bug that is not yet fixed.

jayden82821 commented 10 years ago

Long path tool can helpful on this situation. Thanks

ahoward commented 10 years ago

@mattconnolly it's a massive design flaw that mongos gridfs specification allows duplicate filenames. no other fs in the world allows this. neither does url space. in practice adding duplicate filenames renders data useless as it cannot be exported or imported from fs or url space.

regardless, the library fixes this for you


io = open 'README.md'

grid = Mongoid::GridFs

file = grid.put io

file.filename #=> "5314be99af481ce99a000001/README.md"

this behavior is documented in the readme and handles the unique naming issue for you

the hash interface ([] and []=), by definition, requires unique keys

when used alongside carrerwave for file uploads the configuration of that library had better ensure that the path/key the file is stored is unique - just like all other adapters such as fs, s3, etc

so, there is no bug, the library is working as intended

@lajunta i think your code must be deliberately subverting the interface to hit that error - can you show it?

in summary there are only two possible use cases for users of this library (or any fs):

mattconnolly commented 10 years ago

@ahoward, I disagree.

The mistake in my opinion was calling it "file system" instead of "file store". There are many other features of a file system that do not exist in GridFS, such as folders, links, users, permissions, date created, modified, and so on. Many of these things can be saved in metadata and are the responsibility of the application using GridFS, but not of GridFS itself.

Let us imagine a dropbox like application where user X uploads a file called "avatar.jpg". User Y also uploads a file of the same name. With your gem this application could not be built. In this application, you may want to ensure that each user cannot upload multiple files of the same name. So this is the application's responsibility to manage, not that of the GridFS driver. When the user downloads the file, the headers might contain things like the content-type, filename and so on. This particular filename has no path information, and is optional as well. (HTTP file downloads do not require a content-disposition header specifying the filename). This is a great use case of the metadata in GridFS.

Files stored in GridFS do have a unique identifier and that is the ObjectId of the file record. If you need unique URLs, then this identifier should be used - especially since the file name is optional metadata.

Another problem caused by unique filename constraint is the inability to safely update a file. GridFS does not support in place updates, so to update a file, a new copy must be uploaded. With a unique constraint on the filename, the original file must be deleted before uploading the new one. This provides poor failure handling capability: what happens if the file upload fails? The original has been deleted. Once again, this is an application level problem that the application should be empowered to deal with in whatever manner is correct for the application. If the application has to rename the file, it has to verify that the temporary name has not been used, and so the application becomes more complicated to work around this unnecessary constraint.

Justifying that another gem uses this feature unfairly restricts the capabilities of other users, and I do not see this as validation that it is the correct thing to do. Filename uniqueness should be an application responsibility.

Furthermore, I implore you to honour the specification of GridFS - whether you agree or not - so that applications using other languages (C# or Java drivers for example) can access the same data in a compatible manner. Please refer again to the link above which clearly states that the filename is optional metadata. It just does not make sense to put a unique index on an optional field.

ahoward commented 10 years ago

@mattconnolly reasonable. personally, i've found that working with apis that support duplicate filenames is like eating glass: i've really appreciated being able to rely on unique filenames, vs. having to reconcile dups and track everything by id. i'd consider a pull request that

consider carefully the fact that

is a massive race condition in mongo that requires findAndModify to solve in the best case, worse in a non-simple replication setup, and this resulted (when i tried it) in a pretty non-POLS api. compare this to tacking on a uuid to filenames when it bothers someone, especially when they are just 'optional metadata', and i think you'll end up where i did: thinking that letting users deal with the unique constraint by string bashing (done automatically by the lib) requires a good 500 LOC of complex code less that doing the opposite: making all developers that want unique filenames learn about the CAP theorem ;-/

also, consider your logic:

ps. file downloads absolutely will not work correctly without a filename in many flavors if IE. it botches the mime-type and totally relies on the .ext - same goes once the file is on disk in windoze: a fubar file ext will spawn the wrong application when the file is opened, etc.

pss. non-unique filenames actually makes replacement much, much, much harder. if you replace the chunks, and then cleverly 'rename' the meta-data record, you will create a fubar file if replication is being used (unless all writes and reads come from the master) where some people get the file intact, and some people get it intermixed/corrupt. i know this because i've done it. consider what replication and '1000s of file chunks' associated with a metadata record means (hint: not all 1000 chunks replicate at once). the only sane way, with large data and replication to safely use grid_fs in production applications is to consider the file store to be 'append only' - that is to say never update and file in place and always write the metadata last (praying the the chunks will replicate sooner from the op-log)

in any case, lest the above sound snarky: i really will consider a pull request that makes uniqueness optional - provided it is safe and does not contain race conditions that results in rendering corrupt files

ahoward commented 10 years ago

@mattconnolly a question for you too: if you were hitting duplicate filenames you must have specifying them yourself, because the library ensures unique names are created by default. if this was the case (you were specifying non-unique filenames) then that info is obviously meta-data, so why not simply decorate the file object with that metadata, as the grid_fs specification allows for?

ahoward commented 10 years ago

tests pass. i want to mull on this longer tho....

https://github.com/ahoward/mongoid-grid_fs/commit/542119a2c3e1e6545291530c37f307b171d69fc1

mattconnolly commented 10 years ago

The scenario where I saw duplicate filenames was in an existing MongoDB database where files have been saved by another system (using C# mongo driver). In that system the relationship between the files and other parts of the system is solely by the "_id" of the file, stored as a foreign key in the other collection (let's call it "documents"). (Using a foreign key like this doesn't have any of the problems of replacing files or getting mixed file updates since there is an atomic switch from one file ID to another on the record in the documents collection.) Filenames were used, but only as name of file when downloaded, not as a unique identifier to locate the file in GridFS.

This is design works fine, conforms to the GridFS spec and works with the C# driver.

I need to be able to disable the unique index in order to use this database simultaneously from Ruby.

This is aggrevated by Mongoid wanting to control the indexes on collections it knows about.

ahoward commented 10 years ago

@mattconnolly i think you may mis-understand the current implementation when you say

Let us imagine a dropbox like application where user X uploads a file called "avatar.jpg". User Y also uploads a file of the same name. With your gem this application could not be built.

because this application is easy to build. you simply do

open('avatar.jpg'){|io| grid.put(io)}

the library will generate a unique name $object_id/avatar.jpg

https://github.com/ahoward/mongoid-grid_fs/blob/master/lib/mongoid-grid_fs.rb#L183

FYI

ahoward commented 10 years ago

@mattconnolly well, that's easy to fix


irb(main):006:0> Mongoid::GridFs::File.index_options
=> {{:filename=>1}=>{:unique=>true}}
irb(main):007:0> Mongoid::GridFs::File.index_options.delete({:filename=>1})
=> {:unique=>true}
irb(main):008:0> Mongoid::GridFs::File.index_options
=> {}
ahoward commented 10 years ago

or even


Mongoid::GridFs::File.index_options.delete_if{|hash| hash.keys.include?(:filename)}

easy cheasy

rmm5t commented 10 years ago

While I completely understand @mattconnolly's arguments. Especially because mongoid-grid_fs slightly steers from the original GridFS spec and reference implementations, I can see how this implementation might seem strange. However, I agree with the approach that @ahoward took.

I guess it depends on how you want (or expect) a "File System" to work. Personally, I think of GridFS as an easier to deploy NFS or maybe even as a poor-man's Amazon S3. Both require unique filenames. Directories can be simulated in S3 by including forward slashes in the object's name itself.

I will admit that the mongoid-grid_fs's implementation caught me off guard when I first saw it, because I was familiar with the GridFS spec, but I later came to the conclusion that it perhaps contained a better set of defaults. It's nice to rely on unique filenames.

Also, considering you can always drop the unique index on fs.files, how much of an issue is this still?

ahoward commented 10 years ago

i've thought about this more and think it may be possible to support all cases, see my recent branch @ https://github.com/ahoward/mongoid-grid_fs/commit/542119a2c3e1e6545291530c37f307b171d69fc1

in particular note:

these changes bring the library very, very close to the spec without, i think, altering any behavior for existing clients. regardless, i bumped the version to 2.0.0. some lingering concerns

ahoward commented 10 years ago

/cc @rmm5t and @mattconnolly

mattconnolly commented 10 years ago

The metadata field is solved. Not sure if this has been released yet though.

I had an issue in mongoid for this: https://github.com/mongoid/mongoid/issues/3362

My pull request was rejected: https://github.com/mongoid/mongoid/pull/3364

But the author has renamed the field in this commit: https://github.com/mongoid/mongoid/commit/22d7b0b6f45e8ce7d93753111833fa2e550f6e52

ahoward commented 10 years ago

@mattconnolly yep - saw that. can you (and others) test my new branch?

ahoward commented 10 years ago

bump @mattconnolly does this branch work for your use-case?

mattconnolly commented 10 years ago

This is looking good to me. Running this with mongoid-4.0.0.beta is working nicely for me.

Thanks, very much @ahoward!!!

mattconnolly commented 10 years ago

(@ahoward - to your earlier point about duplicate file names, yes our application names files explicitly. We have data arriving over another network protocol, so the readable that we save is an in memory buffer, not a file with a filename. This happens using the C# mongo driver, and when files are replaced we upload the new file with the same name, switch the foreign key and finally remove the old file. The file ids are embedded in other documents, so the files are identified by their ids, not their filenames. We will now be doing the same sorts of things from some ruby processes too. Thanks again.)

ahoward commented 10 years ago

thanks for the feedback @mattconnolly - i need to run one more whitebox test and i'll make a new gem release (early next week). cheers.

ahoward commented 10 years ago

gem v2.0.0 (now master) contains this fix.