unnest_constant --> Drop all the new columns that have null values > unnest_constant in order to ignore any "rare" fields.
Example:
0.4
Default: 0.5
unnested_fields --> JSON/Dictionary that specifies how to unnest a json
Example:
{
"key1": 1 # will be used
"key2": 0 # will not be used
"key3": {
"key4": 0.7 # will not be used if it doesn't occur in 70% of the rows
"key5": 0 # will not be used
}
}
Default: {}
Note, if a key hierarchy is not present in the above, it will be treated based on the frequency of occurrence.
Predict interface
We use dot notation for the unnested columns, so col a = { b: 43, c: { d: 'fsddsd', e:55 } } becomes a.b, a.c.d,a.c.e.
This will allow us to query the predictor from mongo client using the unnested columns since mongo has . notation. E.g. the bellow two statements are both valid mongo syntax and mean the same thing:
db.inventory.find( { "size.uom": "cm" } )
db.inventory.find( { size: { uom: "cm" } } )
So we should be able to query the predictor using the original a column as a JSON from mongo, and then it will get unnested. But also to query using the first notation above and specify the unnested columns directly, since we use . for the name, so it's the same as mongo's . notation.
Similarly, with the HTTP+Native interfaces, we'll be able to use both the raw JSON, which will get unnested or the unnested columns.
In Scout and in the SQL API, which both require the column list, we'll only be able to use the . notation
Where do we put this code?
Probably in the DataExtractor but I'm not sure, I think that's the best place.
How to we handle nested JSONs as values for columns?
TL;DR Unnest them, create columns for each of the unnested keys, proceed as usual
Training interface
Use https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html on column.
unnest_constant
--> Drop all the new columns that have null values >unnest_constant
in order to ignore any "rare" fields.Example:
Default: 0.5
unnested_fields
--> JSON/Dictionary that specifies how to unnest a json Example:Default: {} Note, if a key hierarchy is not present in the above, it will be treated based on the frequency of occurrence.
Predict interface
We use dot notation for the unnested columns, so
col a = { b: 43, c: { d: 'fsddsd', e:55 } }
becomesa.b
,a.c.d
,a.c.e
.This will allow us to query the predictor from mongo client using the unnested columns since mongo has
.
notation. E.g. the bellow two statements are both valid mongo syntax and mean the same thing:So we should be able to query the predictor using the original a column as a JSON from mongo, and then it will get unnested. But also to query using the first notation above and specify the unnested columns directly, since we use
.
for the name, so it's the same as mongo's.
notation.Similarly, with the HTTP+Native interfaces, we'll be able to use both the raw JSON, which will get unnested or the unnested columns.
In Scout and in the SQL API, which both require the column list, we'll only be able to use the
.
notationWhere do we put this code?
Probably in the
DataExtractor
but I'm not sure, I think that's the best place.