remove duplicate file name deletion. filename is not a unique identifier.

ryancole commented 12 years ago

After reading the MongoDB documentation, it does not look like a file's name is supposed to be a unique identifier in GridFS. Looking over the Mongolian code, the MongolianGridFile#save method performs a findOne query, using the filename and id properties. If it locates a file with the same name, it deletes it. How to save files with the same name, then? I believe this is a mistake in the code. I also do not recognized the used id property - is that supposed to be _id? The existing GridFS file record does not have an id record, so that'll never match anyways, as far as I can tell.

I have removed this check-and-delete logic and the code appears to function as I would expect, now. I am able to insert multiple files with the same name, and they're giving their own record in GridFS.

This does break the unit tests, but after looking over the unit tests it looks like they were written from the perspective of a file's name being unique, which I think invalidates the unit tests as a whole for GridFS, and so I did not bother updating them.

Can we review the reasons why you consider filename to be unique, @marcello3d?

ryancole commented 12 years ago

Here is the application that I wrote to test this. I'm simply saving the code for this application to the database. Multiple executions should created additional records. Currently, in Mongolian master branch, this does not create multiple records. It deletes the previous record.

var fs = require('fs'),
    util = require('util'),
    mongolian = new (require('/home/ryan/javascript-libs/node-mongolian')),
    database = mongolian.db('test'),
    gridfs = database.gridfs();

// create a new gridfs file
var file = gridfs.create('foo'),
    write_stream = file.writeStream();

// pipe local data into it
fs.createReadStream('app.js').pipe(write_stream);

// save the file
file.save(function () {

    // list all files in this collection, now
    gridfs.find().forEach(function (document) {

        // print this file out to console
        console.log('file: ' + util.inspect(document) + '\n');

    }, function (err) {

        if (err) { console.log('error: ' + err); }

    }); 

});

marcello3d commented 12 years ago

You're correct in your understanding. The id bit is definitely a bug.

My reasoning was that if I save a file to /foo/bar.txt, and I read a file from /foo/bar.txt, I would expect to get the same file back.

I went back and re-read the GridFS spec: it's not clear, but I believe you're right (the only hint is that filename is optional). My impression of their specification was that it left the details to the driver implementor.

Digging into the Java driver source also indicates that filename doesn't have to be unique and you can retrieve multiple files from the same filename.

I wonder if it makes sense to make this an option. I don't think I'm alone in thinking that the duplicate file behavior is unexpected. Perhaps we should ask 10gen?

That said, we'll need new unit tests if we're going to switch.

ryancole commented 12 years ago

I read the documentation as basically saying that GridFS is simply a specification for splitting file data up within a regular document collection. So, in essence, it's the same as all other collections except it has some more requirements if it's to be used for storing large documents - and that is that documents are to be chunked. With that in mind, by default, _id is the only unique field in a collection.

Even more specific to the case of GridFS, I like to think of it more as a database for files, rather than a file-system for an OS. In this case, _id would be the key, and the document payload would be the value. All other attributes would just be metadata, including filename. I think that the mongofiles tool that ships with mongodb is a good example of this - it allows duplicate file names.

In fact, I've used the Python, Java and nodejs native drivers for MongoDB on projects in the past, involving GridFS, and I don't think any of them imposed the uniqueness of attributes for GridFS, now that I think about it. (Edit: I'm actually unsure of the node-mongo-native lib. I'm trying it out, and it may be doing some sort of uniqueness, too. I'm looking into it.)

In terms of this being an option, personally I'd rather have a decision like that better be left up to logic in my code rather than my database driver's code. It's really not that difficult of a code check to put in a routine that creates a new GridFS document, anyways. It just seems unnecessary to me, personally.

I have asked on IRC, in #mongodb on Freenode, and have not received any response yet. I'll idle there for the next few hours while I code and see if anyone responds.

marcello3d commented 12 years ago

I agree with your reasoning. I'll see what I can do for the next release.

marcello3d / node-mongolian

remove duplicate file name deletion. filename is not a unique identifier. #90