Read video directly from MongoDB gridfs

marcocaggioni commented 6 years ago

Hi, I've been trying to do video analysis reading files directly from a MongoDB gridFS storage using PIMS.

In a MongoDB gridfs document I can store very large files but I can also add metadata in Json format that I can use for data like calibration and sample specification and I can also store results from tracking back into the same json document so that when I retrieve the video I already have the tracking or any other analysis ready for use.

It works with CINE files with a very small modification to the open cine function, if I'm passing a gridout cursor, which is the file-like object returned by a mongodb query, I just skip the open function and from there everything works. I can ask for a specific frame without downloading the video.

Question for you, do you think this could be done also for other format like MOV, I see that you enable random access via pims but I think since the read function uses subprocess the mongodb filelike object is not accessible by the open function.

Hope this is clear enough for you to tell me if you already thought to some solution that would allow to keep files and data in a document based database and have pims accessing the frames without downloading the full video.

Thanks again for your libraries and all your nice work Marco

nkeim commented 6 years ago

Hi! This is a very cool use of pims!! Please consider submitting a pull request with your modification for the cine reader, at least, that lets __init__ accept a file-like object in place of a filename. I think some of our users could come up with some creative uses for that.

As far as I can tell, you are right that ffmpeg and its ilk need to open an actual posix file. It is hard to imagine a workaround at the Python level (although at a deeper level, you could use FUSE). The cine and norpix readers are implemeted in Python because those are simple formats that are very good for random access; that is generally not true of MOV, etc.

I wonder if you could process a MOV file once and convert it to a more suitable format for distributed computing (HDF5/NetCDF?).

marcocaggioni commented 6 years ago

Hi, Thanks for your answer. basically for the cine reader function the only modification I did is to check in the init if I'm passing a Gridout or a filename

def __init__(self, filename, process_func=None,
             dtype=None, as_grey=False):

    if type(video)==gridfs.grid_file.GridOut:
        self.f=filename
    else:
        self.f = open(filename, 'rb')

and the rest works as if I am pointing to a file. This is probably not a good way because I need to include import gridfs just to perform the check so there may be a better way.

I'm not familiar with a pull request so let me know how I should do .

Your idea to convert the video to a simple format as it is uploaded to a database is interesting, in principle would not be difficult to write a python function to write cine files but I'm not sure if they consider the cine format proprietary.

If someone is interested I put together an example of how to do particle tracking on video stored in a database in this repository https://github.com/marcocaggioni/microrheology-test

It works on binder but if you want to run it locally it assumes a local mongodb listening on port 27017 Hopefully the structure of the notebooks is clear.

If you get to the step4 notebook I also put an example of an analysis called Differential Dynamic Microscopy that is complementary to particle tracking run on same data but uses a different approach similar to dynamic/static light scattering. I may open another discussion thread, I think would be nice to include DDM in the Soft-Matter package and I'll be happy to help but not very expert in making libraries so I would need some help.

thanks

nkeim commented 6 years ago

Cool!! Perhaps you could troubleshoot the binder config for https://github.com/soft-matter/trackpy-examples 😃. In the "Step1" notebook, however, you do need to include !mkdir -p data/db

It would be both easiest and best to let the reader use any seekable file-like object, not just from MongoDB:

try:
    _ = filename.tell()
    is_openfile = True
except (IOError, AttributeError):
    is_openfile = False

You should also double-check that the reader seeks to position 0 before reading the file header.

I haven't read the terms in a long time, but I suspect the legal status of a cine writer is about the same as a cine reader. Multipage tiff might be the closest thing we have to a universal scientific movie format, except it has a lot of variations. But you may be able to use those libraries with gridfs. MongoDB also scales to huge numbers of records, so maybe you could store each frame and then write your own pims reader.

soft-matter / pims

Read video directly from MongoDB gridfs #297