jsherfey / dnsim

Dynamic Neural Simulator - a modular modeling tool for large ODE systems.
3 stars 3 forks source link

Consolidate open data output filetype #2

Open asoplata opened 9 years ago

asoplata commented 9 years ago
 % save spec in unique specfile in specs dir
specfile = fullfile(rootoutdir,'model',[prefix '_model-specification.mat']);
save(specfile,'spec','parms');
filenames{end+1}=specfile;
Suggested Options
  1. single JSON file.
    • Pros:
      • Plaintext, and decently human-readable
      • Very widespread adoption of a single standard, as far as I can tell it's like the Visa card of hierarchical data formats -- everyone can process it
    • Cons:
      • Plaintext, so there's no inherent, fancy compression built-in -- it will take up more space
      • Based on my testing, the space it takes up also increases faster with increasing data sizes than the MATLAB .mat files
      • Depending on the efficiency of the JSONlab library, it may take considerable time to write the JSON to disk. In my largest test (see below), the 40,000 time points were quick to simulate but actually took longer to save/export the JSON to disk than it took to run the actual simulation.
    • Testing (all with dt = 0.01, probably in milliseconds, 1 typical HH cell):
      • with 4,000 time points, the simulation's .mat file is 36.5 MB, but the JSON exported from MATLAB is 55.7 MB.
      • with 40,000 time points (10x, making this about 4 seconds), the simulation's .mat file is 74.4 MB, but the JSON exported from MATLAB is 232.2 MB.
      • if that's a single cell (4 state variables) for 4 seconds, then 200 cells for 20 seconds, in one of the biggest scenarios, will mean a very very big JSON. So big, in fact, that it'll probably reach the smallest definition of "big data": that you can't fit all the data inside your RAM at the same time.
      • Apparently dealing with very large JSON data is a thing in web programming, like when dealing with largescale Twitter etc. APIs. Thus, it appears that there are tools for "streaming" JSON data, a la https://www.opencpu.org/posts/jsonlite-streaming/ . This could complicate calculation of summary statistic measures, though, as not all the data is held in memory (but that's sort of a general problem anyways.).
  2. single JSON file + compression
    • I've tried gzip-ing the larger JSON file, but it still doesn't get down to the same data size as the .mat version. This would also add overhead in packing/unpacking, and would only help in storage of the raw data. I go so far as to delete my raw data usually after processing it, since one should always be able to recreate it if need be. Plus HDF5 has built-in methods for compression anyways.
  3. single JSON file for metadata (including filesystem links to "real" data files) + "real" data files (in something like CSV)
    • Pros:
      • If JSON streaming (see above) was too unwieldy, this could work since tools for using CSV files in data science languages are very powerful.
    • Cons:
      • Strong increase in complexity -- need to implement a naming scheme, keep track of the files, etc. This part in particular seems very ugly.
      • Probably very little/no improvement in data size growth compared to pure JSON.
      • This is essentially a "cheap" (in the bad sense) version of what HDF5 is precisely trying to do.
  4. HDF5 (Hierarchical Data Format version 5)
    • Pros:
      • I think this is the serious contender.
      • Supposedly really fast read/write capabilities, somehow. I've seen things about how its components are "treated as part of the file system". It could just be that the low-level copy/write functions are simple and maybe even optimized
      • A well-defined open standard. In fact, supposedly MATLAB's version -7.3 (or maybe -7) filetype is extremely similar, save for undisclosed proprietary differences. This also means it's widely supported in data science languages (Java, MATLAB/Scilab, Octave, IDL, Python, R, and Julia).
      • Has built-in compression capabilities, though I'm unsure how they compare to MATLAB's.
      • Obviously, has built-in hierarchical structure. This enables
        1. Readymade "SQL-like"/"subsettable" data treatment, meaning, part of why it was built for supercomputers was ease of use in separately loading individual portions of the data, to do processing in parallel.
        1. While it's not instantly obvious like JSON or CSV is, for the purposes of our data structure, it does seem like it won't be very complicated to organize the structure inside of it.
      • Supposedly "self-describing", which I think means you can define your own operations? I'm not sure
      • Since it's still a file, you can readily just copy it to other computers. A plus compared to databases, but still offers an incredible amount of features and organization (like pointers to data) akin to databases but without this shortcoming.
    • Cons:
      • Would have to learn its "way of doing things", which seems straightforward on initial inspection
      • Would either have to
      • A. manually translate current data schema into specific HDF5 structure (problematic since then all changes to schema would have to be made to the translator), or
      • B. write a function to arbitrarily do the above, which...apparently no one has written and submitted to MathWorks Central. That could be because it's ridiculously difficult, but also not, and I think if our data structure can be outputted to JSON (which it can), then it should be doable in HDF5 too; there's just not that many data types involved.
      • Esoteric: the fact it's by and for serious scientific computing could be intimidating to newcomers/people with little programming backgrounds, even though it seems simple. But then again, the core data structure is in MATLAB already, and if they're dealing with the HDF5, that's because they want to be programming data analysis anyways, so we can assume they're interested in writing code in the firstplace. Plus, if the open source data analysis is good enough then they won't even care what the data type is :)
      • Apparently cannot read from and write to individual "dataset" elements at the same time (https://dpservis.wordpress.com/tag/hdf5/), but I'm not sure we need to
    • Testing:
      • I haven't tested this since it would require translating/mapping our current data structure into a compatible HDF5 file.
  5. a database? This would be fun to implement but a ton of work, and anyone who write further analysis code would have to learn SQL. In the far future, a centralized database for, say, your own personal projects could be very powerful -- you could still delete the raw data and keep only the results, but then be able to query the database for simulations/results you've done before, probably make cross-simulation analysis much easier, have easy ways to keep track of what changed, grab/visualize both new and old simulations that all fit some parameter sweep at the same time, could build an API for if we had a public server of data, etc...
  6. Just keep the data structure VERY simple, but still in MATLAB, so that it easily translates to other languages using other languagues' MATLAB-data-file-to-Python or -R libraries.
    • Pros:
      • If it works it works
      • Ideally, the end result will have the same logical organization, including similar typing, of the original (low probability though)
    • Cons:
      • We become dependent on these libraries, MATLAB's file type changes, and any incompabilities between these two. In the case of JSON and HDF5, those are standardized enough (or we exert control like in HDF5) that we have a better idea about what we're getting, whereas in this case we're at the mercy of both MATLAB and the libraries' design decisions.
      • There may be unresolvable logical differences, e.g. rmatio is not currently able to read in the same .mat files that R.matlab is, but R.matlab's readin has, honestly, horribly little structure.
      • Thus, we may end up having to re-map the result of the libraries' output into a similar data schema in the end language. And then, this would be language-specific.
  7. Other suggestions?
asoplata commented 9 years ago