neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

data management journal club #150

Open kousu opened 2 years ago

kousu commented 2 years ago

https://github.com/matthew-brett/czi-nibabel got a grant to make a new neuroimaging data format. Or maybe just spec out existing ones.

They were thinking about HDF5 but have discovered reservations.

They are organizing a journal club to talk about it, initially this paper on ASDF

Greenfield, M. Droettboom, E. Bray, ASDF: A new data format for astronomy, Astronomy and Computing, Volume 12, 2015, Pages 240-251, ISSN 2213-1337

We should get involved.

Tagging @naga-karthik @uzaymacar @andreanne-lemay @charleygros @sandrinebedard @alexfoias @dpapp86 @taowa

kousu commented 2 years ago

Well, I know I have an immediate comment to make at Journal Club. That first paper's abstract says

Advanced Scientific Data Format (ASDF) and is based on an existing text format, YAML

YAML has a lot of its own problems: it has extendable types but they're only really extendable when combined with python+pickle, which brings code-injection vulnerabilities along for the ride. It's easy to break a multiline string without noticing (we did this last year and silently broke our CI). It has barewords (like perl) meaning things like "no" is implicitly False (but it could also be: "Norway", "Navigation Order", etc). Like JSON, it doesn't have a canonicalized form so you can't hash it safely. It's a recursive language which makes it expensive to parse -- you can't just run grep or sed over it safely, you need to load the entire thing into memory with a real YAML parser.

I bet googling would find a lot of other problems with YAML.

It's one big advantage over JSON is it allows comments. But for long-term scientific data? I dunno.

kousu commented 2 years ago

Meeting minutes: https://docs.google.com/document/d/1dAbHrDLXxJJJu0A-WxsQTomD738_6JvQ-wGIx_vwR5g/edit#