Idea: Persist to DB - Githubissues

dkfellows commented 3 months ago

As I was going through the types for the persistence code, it struck me that this is the sort of thing that it might be worth serializing to a database. Specifically, SQLite has the right sort of data lifetime (file/directory structure rather than institutional) and has some pretty reasonable support for JSON these days that might be helpful.

This is a low-priority idea, but it might be a good fit?

A previous project switched to SQLite from a very large directory full of binary files that were used to record the data streaming out of various recording channels. Big simulations would have millions of recording channels, so the directory ended up with a very large number of files in it and the simulation control software originally wanted to hold all of them open at once. This would severely strain the OS! We tried various workarounds, but dumping everything into a DB worked best, retaining the same speed as simple binary file dumps for small sims and yet going enormously faster for large ones (because there was only a small number of files and file handles). We also got much better metadata than we had before (because filenames are terribly fragile as metadata holders).

aplowman commented 2 months ago

This is an interesting idea, and I think it could be useful in some scenarios, but I think a SQLite storage format is not suitable for what we've been using hpcflow for so far (high concurrency workflows on HPC), so I agree it would be a low-priority idea.

In a previous incarnation of hpcflow, we used a SQLite database to help manage concurrency (e.g. in a job array, multiple array items need the same file to be written, but it only needs to be written once). It worked fine for small workflows but would fail for larger workflows. Things may have changed since I last checked, but I believe the prevailing view is that it is not a good idea to use SQLite on networked file systems (Lustre, NFS).

I think for HPC use, Zarr is a really good format. In particular, we need the ability to:

Write to the same array from multiple independent processes without any sort of locking (which can be difficult to do reliably on networked file systems)
Slice arrays easily without loading the whole thing into memory. Eventually, I'd like hpcflow to support reading and writing remote workflows, via STFP (e.g. interacting with an HPC resource), or HTTPS (e.g. reading from Zenodo), so good slicing support becomes crucial.

I think where a SQLite backed could be useful is to (perhaps significantly?) improve IO speed for local (non-HPC) workflows, so let's keep the issue open for now. Thanks for your insight!

dkfellows commented 2 weeks ago

Networked filesystems would be the main sticking point, as locking semantics on NFS and CIFS are ghastly (to the point where they're explicitly not supported). And WAL mode doesn't work in that situation either (it's otherwise able to support much higher levels of parallel access with only minor tuning).

Slicing blobs is relatively easy now that the incremental access API is there. In expanded form OTOH you'd naturally express everything in terms of slices; that's just how you write result sets (and the full contents would just be a special case; a "slice" of "everything").

hpcflow / hpcflow-new

Idea: Persist to DB #695