address JSON format issues

mitzimorris commented 3 years ago

while I appreciate the use of JSON as a data format, there are a few concerns:

~for large input datasets which are essentially tabular in shape, JSON is a very verbose format.~
JSON doesn't have enough structure for arrays and matrices:
- cannot represent an empty N-dim array in JSON - all empty arrays are "[ ]"
- doesn't care about array shape
concepts of NaN, neg inf, inf, and missing data have varying representations across different platforms and libraries.

suggestions:

make explicit allowed encodings for NaN and friends
~for strictly tabular datasets, allow CSV file format as well~
for multi-dim numeric data, allow appropriate encoding

update: CSV and tabular data is not that relevant for the kinds of data needed for posteriordb models - striking this point.

MansMeg commented 3 years ago

I agree with you about the problems. I just find it to be the least bad alternative. The problem you point to is good to address. I have actually had this as a problem with the covid19 model. Although, I cant come up with a better alternative. The focus has been on making the use as simple as possible.

Regarding your suggestions: 1) I agree. this should have been done long ago. 2) The problem with CSV is that it will not allow any nesting so it would only work for one simple matrix? Adding that will most likely increase the maintenance burden and reduce the simplicity of use. I would like to avoid this. Is there any benefit except verbose of storing like this? 3) I do not really understand what you mean? A specific encoding for dimensions?

Thanks!

mitzimorris commented 3 years ago

for #3, I was more thinking about other encoding formats - NetCDF? (never used it myself - it's what Arviz is using)

for #2, not a fan of CSV by any means, and the more formats supported, the greater the maintenance burden. but for the benchmarking project, big models require big data - for large regression datasets, if the data already exists in CSV format, then it would be better to offer conversion in a way that is known and controlled instead of putting this burden on the submitter.

mitzimorris commented 3 years ago

actually, take that back - JSON format for data is good for CmdStan, so JSON is the only appropriate choice for input data.

JSON format defined here: https://mc-stan.org/docs/2_25/cmdstan-guide/json.html

ahartikainen commented 3 years ago

I would suggest that every json should have also metadata for variable dimensions and datatype (int vs float).

I think there are ways to circumvent nan and inf issues.

Also, every json should be explicitly be encoded in utf-8 or utf-16.

mitzimorris commented 3 years ago

most JSON libraries can handle "nan", "inf", "+inf", and "-inf", but documentation should be explicit on strings and interpretation.

+1 for metadata - note the JSON doesn't differentiate between int and float - need metadata for both numeric type as well as array shape.

MansMeg commented 3 years ago

Alright. So we have converged toward JSON as an independent format for Stan data with these additions. Should we just add these things and create a JSON schema for this? Or are there already JSON formats that would solve this for us with known schemas?

MansMeg commented 3 years ago

So the question is how we continue with this JSON issue? I guess we would like to use just one format that is used both by RcmdStan and posteriordb. I could just run everything through the write_stan_json(). Would that solve your issue @mitzimorris ? Or should we add the metadata as well first?

Ideally we should have a schema for this so also other outside stan could write and read this format.

ahartikainen commented 3 years ago

Does R force the encoding or is it using the system defaults?

Also, how many digits is needed?

rok-cesnovar commented 3 years ago

Catching up on the discussion here.

write_stan_json() does not do anything all that special wrt to the JSON format. It does the following:

TRUE/FALSE is converted to 1/0
"unboxes scalars": N = 10 is stored as N : 10, not N: [10] (the default for the R package we use for writing is N: [10].
handles R lists of matrices. This is a rarity as far as I know and not something posteriordb needs to worry about. I think.

The NaN and Inf representation is a pain, but I think as previously mentioned, just pick a format, suggest a R/Python package that supports that formatting, and be done with that. Standardization here will not happen any time soon and if you just suggest a JSON reader that supports the representation used here, I think that is more than enough. NaN/Inf in posterior distributions or input data is going to be rare I imagine anyways?

We had a discussion on JSON and representing special numbers in a design doc in January (sidenote: I seriously thought this discussion happened 2 years ago :) ). Maybe skim that so you dont need to rediscover stuff I researched then.

I am not sure how big of a problem a lack of dimensions is. Any JSON reader will obtain the dimensions by reading the JSON file. The only issue with dimensions is representing a 0xN matrix or 2D array or any 0xNx...xM K-dimensional array.

[[],[],[],[],[]]

is a 4x0 2d array (or matrix) but you cant write out a 0x4 array.

My question here is if it's worth it to write out metadata just because of this case?

I don't think there is a problem with vector and row_vector distinction. The first question here is if it matters for a posterior distribution if the result is a vector or a row vector? If the dimensionality is important, we can then represent that as either

[[1],[2],[3],[4],[5]]

which is a 5x1 structure or

[1,2,3,4,5]

which is a 1x5 structure.

Or are there other issues here i am missing?

Also, every json should be explicitly be encoded in utf-8 or utf-16.

+1 on that!

mitzimorris commented 3 years ago

The only issue with dimensions is representing a 0xN matrix or 2D array or any 0xNx...xM K-dimensional array.

we ran into an issue with input data where users supplied empty multi-dim arrays - we had to hack Stan's var_context reader to address this - https://github.com/stan-dev/stan/issues/2867

don't know if this is a concern for posteriordb - perhaps?

jgabry commented 3 years ago

don't know if this is a concern for posteriordb - perhaps?

Maybe not much of a concern. I don't think it's necessary for posteriordb to have models that use empty data arrays (I can't think of a case where that's essential for what posteriordb is intended for). But I could be wrong!

MansMeg commented 3 years ago

I think I have run into the problem of the difference between a 111 array and an ordinary scalar in R for some model, but I do not remember how I solved that.

I suggest that I, for now, add that the JSON files should be in UTF-8 format and that we open up to have an __info slot in data JSON to define specific information if that comes up further down the line. Does that sound like a solution for now?

ahartikainen commented 3 years ago

Sounds good.

MansMeg commented 2 years ago

I think everyone agreed to the proposed solution so I close this for now.

stan-dev / posteriordb

address JSON format issues #208