Open harshit3610 opened 3 years ago
What about
DataFrame.readCSV(File("foo.csv"), CSVFormat.DEFAULT.withNullString("MISSING"))
?
Technically a dedicated argument could be added, but I'm not sure if this would bloat the method signatures in the long run.
A known limitation of the underlying apache commons API is that you can only provide a single null string and not a collection.
Is there a way to remove apache commons as a requirement? Can we provide a mechanism to replace all occurrences of a custom null string with "NA" value while reading the file ? The limitation of only accepting a single null string will be a huge limitation in the long run and may affect the adoption rate of the library by enthusiasts. Are there any technical limitations as to why apache commons must be used?
Why would we want to replace apache-commons-csv? What would be a better alternative?
I've chosen apache-commons-csv initially here because I could not find any better alternatives.
I see the point that having just a single NA string is limiting, but I don't think its a major problem.
In pandas API, a typical read_csv function allows adding multiple custom NA values in the following way
_pd.read_csv("data.txt",navalues = [ 'na', 'Not available', "", "-"])
In many data sets, we have data that's not up to the mark and multiple strings for NA data exist. I was hoping if there would be a way to add such an argument(na_values) to krangl API's read functions with option of passing an array of strings or a list of strings similar to how pandas makes it work. The added trouble of adding apache commons as a dependency only to get 1 single NA string option is too much effort in my opinion
As per API docs, CSVFormat class from apache commons is used to add custom null values while reading files. I am new to Krangl so may not know of any simple workarounds. Is it possible to add a function similar to na_values of pandas to make the read operation little simple?