holgerbrandl / krangl

krangl is a {K}otlin DSL for data w{rangl}ing
MIT License
560 stars 50 forks source link

Consider adding option to specify custom NA values while reading a CSV file by passing a list/array of strings #120

Open harshit3610 opened 3 years ago

harshit3610 commented 3 years ago

As per API docs, CSVFormat class from apache commons is used to add custom null values while reading files. I am new to Krangl so may not know of any simple workarounds. Is it possible to add a function similar to na_values of pandas to make the read operation little simple?

holgerbrandl commented 3 years ago

What about

DataFrame.readCSV(File("foo.csv"), CSVFormat.DEFAULT.withNullString("MISSING"))

?

Technically a dedicated argument could be added, but I'm not sure if this would bloat the method signatures in the long run.

A known limitation of the underlying apache commons API is that you can only provide a single null string and not a collection.

harshit3610 commented 3 years ago

Is there a way to remove apache commons as a requirement? Can we provide a mechanism to replace all occurrences of a custom null string with "NA" value while reading the file ? The limitation of only accepting a single null string will be a huge limitation in the long run and may affect the adoption rate of the library by enthusiasts. Are there any technical limitations as to why apache commons must be used?

holgerbrandl commented 3 years ago

Why would we want to replace apache-commons-csv? What would be a better alternative?

I've chosen apache-commons-csv initially here because I could not find any better alternatives.

I see the point that having just a single NA string is limiting, but I don't think its a major problem.

harshit3610 commented 3 years ago

In pandas API, a typical read_csv function allows adding multiple custom NA values in the following way

_pd.read_csv("data.txt",navalues = [ 'na', 'Not available', "", "-"])

In many data sets, we have data that's not up to the mark and multiple strings for NA data exist. I was hoping if there would be a way to add such an argument(na_values) to krangl API's read functions with option of passing an array of strings or a list of strings similar to how pandas makes it work. The added trouble of adding apache commons as a dependency only to get 1 single NA string option is too much effort in my opinion