JSON un-nesting - Githubissues

How to we handle nested JSONs as values for columns?

TL;DR Unnest them, create columns for each of the unnested keys, proceed as usual

Training interface

Use https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html on column.

unnest_constant --> Drop all the new columns that have null values > unnest_constant in order to ignore any "rare" fields.

Example:

0.4

Default: 0.5

unnested_fields --> JSON/Dictionary that specifies how to unnest a json Example:
```
{
"key1": 1 # will be used
"key2": 0 # will not be used
"key3": {
     "key4": 0.7 # will not be used if it doesn't occur in 70% of the rows
     "key5": 0 # will not be used
 }
}
```
Default: {} Note, if a key hierarchy is not present in the above, it will be treated based on the frequency of occurrence.

Predict interface

We use dot notation for the unnested columns, so col a = { b: 43, c: { d: 'fsddsd', e:55 } } becomes a.b, a.c.d,a.c.e.

This will allow us to query the predictor from mongo client using the unnested columns since mongo has . notation. E.g. the bellow two statements are both valid mongo syntax and mean the same thing:

db.inventory.find( { "size.uom": "cm" } )

db.inventory.find( { size: {  uom: "cm" } } )

So we should be able to query the predictor using the original a column as a JSON from mongo, and then it will get unnested. But also to query using the first notation above and specify the unnested columns directly, since we use . for the name, so it's the same as mongo's . notation.

Similarly, with the HTTP+Native interfaces, we'll be able to use both the raw JSON, which will get unnested or the unnested columns.

In Scout and in the SQL API, which both require the column list, we'll only be able to use the . notation

Where do we put this code?

Probably in the DataExtractor but I'm not sure, I think that's the best place.

mindsdb / mindsdb_native

JSON un-nesting #420

Training interface

Predict interface

Where do we put this code?