rgieseke / pandas-datapackage-reader

Data Package reader for Pandas
Other
19 stars 6 forks source link

Read "missingValues" field #2

Closed rgieseke closed 4 years ago

rgieseke commented 7 years ago

https://specs.frictionlessdata.io/table-schema/#other-properties

ghost commented 4 years ago

I also have the problem that e.g. 'n/a' are not interpreted as NaNs in my data file. When reading a csv file, na_values argumen is explicitly set to "" to disable interpreting, but allowed values should come from the "missing_values" key.

PS. Thanks for the nice package anyway!

rgieseke commented 4 years ago

Probably requires a combination here:

Pandas Docs:

na_values scalar, str, list-like, or dict, optional

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

keep_default_na bool, default True

Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows:

    If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

    If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.

    If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

    If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.

Frictionless Spec:

missingValues dictates which string values should be treated as null values. This conversion to null is done before any other attempted type-specific string conversion. The default value [ "" ] means that empty strings will be converted to null before any other processing takes place. Providing the empty list [] means that no conversion to null will be done, on any value.

rgieseke commented 4 years ago

Can you try this one (just released)?

https://pypi.org/project/pandas-datapackage-reader/0.15.0/

ghost commented 4 years ago

Thanks @rgieseke! However, I noticed that the missingValues key is searched in the wrong place now. By the Frictionless specs it should be under the table schema. Now it’s looked under top-level resource keys:

        if "missingValues" in resource.keys():
            missing_values = resource["missingValues"]
        else:
            missing_values = ['']

Right place should be resource["schema"]["missingValues"].

rgieseke commented 4 years ago

Made a release: https://pypi.org/project/pandas-datapackage-reader/0.16.0/