javascriptdata / danfojs

Danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.
https://danfo.jsdata.org/
MIT License
4.79k stars 209 forks source link

Can't specify dtype of csv column #177

Closed spleshakov closed 3 years ago

spleshakov commented 3 years ago

Page https://danfo.jsdata.org/api-reference/input-output/danfo.read_csv says I can specify any option supported by tensorflow (csvConfigs: other supported Tensorflow csvConfig parameters)

Their documentation says csvConfig.columnConfigs[columnHeader].dtype can be any value of int32, float32, bool, or string (https://js.tensorflow.org/api/latest/#data.csv)

However, running the below code doesn't convert the columns values into strings

service_areas.csv

SERVICE_AREA_ID|ZIPCODE|STATE_CODE|COUNTY_CODE|COUNTY_NAME|SERVICE_AREA_NAME|PLAN_YEAR
"North"|56762|"MN"|"24440"|"Marshall"|"North"|2021
"PR North"|56762|"MN"|"24440"|"Marshall"|"PR North"|2021

javascript

service_areas_data = await danfojs.read_csv(
        "zpr_service_areas.txt",
        {
            delimiter: "|",
            columnConfigs: {
                "ZIPCODE": {
                    dtype: "string"
                }
            }
        }
    )
service_areas_data.ctypes.print()

results in output

╔═══════════════════╤══════════════════════╗
║                   │ 0                    ║
╟───────────────────┼──────────────────────╢
║ SERVICE_AREA_ID   │ string               ║
╟───────────────────┼──────────────────────╢
║ ZIPCODE           │ int32                ║
╟───────────────────┼──────────────────────╢
║ STATE_CODE        │ string               ║
╟───────────────────┼──────────────────────╢
║ COUNTY_CODE       │ int32                ║
╟───────────────────┼──────────────────────╢
║ COUNTY_NAME       │ string               ║
╟───────────────────┼──────────────────────╢
║ SERVICE_AREA_NAME │ string               ║
╟───────────────────┼──────────────────────╢
║ PLAN_YEAR         │ int32                ║
╚═══════════════════╧══════════════════════╝

This is likely a bug on tensorflow side, since this switch clause https://github.com/tensorflow/tfjs/blob/623da7ecbada115425888c62bd65df685e2bdd75/tfjs-data/src/datasets/csv_dataset.ts#L253 has all specified values but string. Default - parsedValue = valueAsNum;

I did open an issue with them https://github.com/tensorflow/tfjs/issues/4962, but code snippet using danfo isn't working for them and I'm not familiar with tensorflow.js, so I can't provide a code to reproduce an issue

risenW commented 3 years ago

@spleshakov Have you seen the astype function: https://danfo.jsdata.org/api-reference/dataframe/dataframe.astype It might be what you are looking for unless you need to explicitly set the type on read.

spleshakov commented 3 years ago

Although, it is not ideal since it is an extra line of code and extra operation, but it works for me, thank you.