Closed rgieseke closed 4 years ago
I also have the problem that e.g. 'n/a' are not interpreted as NaN
s in my data file. When reading a csv file, na_values
argumen is explicitly set to ""
to disable interpreting, but allowed values should come from the "missing_values"
key.
PS. Thanks for the nice package anyway!
Probably requires a combination here:
Pandas Docs:
na_values scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
keep_default_na bool, default True
Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows: If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing. If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing. If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing. If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN. Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.
Frictionless Spec:
missingValues dictates which string values should be treated as null values. This conversion to null is done before any other attempted type-specific string conversion. The default value [ "" ] means that empty strings will be converted to null before any other processing takes place. Providing the empty list [] means that no conversion to null will be done, on any value.
Can you try this one (just released)?
Thanks @rgieseke! However, I noticed that the missingValues
key is searched in the wrong place now. By the Frictionless specs it should be under the table schema. Now it’s looked under top-level resource keys:
if "missingValues" in resource.keys():
missing_values = resource["missingValues"]
else:
missing_values = ['']
Right place should be resource["schema"]["missingValues"]
.
Made a release: https://pypi.org/project/pandas-datapackage-reader/0.16.0/
https://specs.frictionlessdata.io/table-schema/#other-properties