sciapp / sampledb

Sample and Measurement Metadata Database
https://scientific-it-systems.iffgit.fz-juelich.de/SampleDB/
MIT License
21 stars 11 forks source link

Storing uploaded files in DB #32

Closed NicolasCARPi closed 2 years ago

NicolasCARPi commented 2 years ago

Hello,

From the backup and restore guide: By default, new files will be stored in the database instead of in the file directory, so the file directory may be empty. And indeed, the files folder stays empty now!

Can I ask why you decided to change this behavior and store uploaded files in the db? Don't you think it might cause issues later on with many big binary blobs in the db? I would be interested to know what pushed you towards this decision :)

Also, shouldn't it make sense to migrate all previously stored files in the db during an update of the version?

Finally, maybe the code, doc and dockerfile should be updated to reflect that this directory/mount is no longer used.

FlorianRhiem commented 2 years ago

Hey Nicolas,

just as SampleDB aims to store the metadata for samples, measurements, experiments, etc, the file support in SampleDB is mostly aimed at small files that aid reproducibility, such as instrument or experiment configuration files, or provide some additional context, such as a preview image of a scan, not actual measurement results/files. Those can vary wildly in size depending on the type of measurement, and while it's feasible to upload small result files to SampleDB, I do not think that uploading large files to a web application is the ideal way to store them. Instead, I generally recommend that users use a storage solution that works best for their type of measurement data and then either explicitly link the location in SampleDB or use an identifier/folder scheme that implicitly creates that link.

The reason I added the database storage type and made it the default is that it is just nicer to take the features provided by postgres, e.g. in regards to atomicity and data consistency, than to have to worry about the separation between database and file system ever causing issues, e.g. during a system crash. Storing the files separately works fine enough, but after some discussion and seeing those issues on a faultily configured development installation, keeping the files in the database appeared best. I trust that postgres handles blobs with reasonable efficiency, and have not encountered any performance issues with that yet. There's a nice list of pros and cons on this topic in the postgres wiki.

Still, keeping things in the file system works well enough, so instead of moving files over during an upgrade and (possibly) having issues if that move wasn't successful for some odd reason, I added a script that does that move (move_local_files_to_database) that can be called if/when the admin wants to.

The code and documentation would probably be a little cleaner if that became mandatory at some point, but so far I have been pretty reluctant when it comes to changing existing behavior. I'd want to warn administrators for at least one version about having to perform changes. A system for automated notifications/warnings for something like that are on my to do list, just low priority so far. :)

NicolasCARPi commented 2 years ago

I understand, it's a reasonable decision albeit it closes the door to uploading bigger files, which is always something users want at some point. But I agree that sampledb is no the right place for such files anyway.