h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.82k stars 157 forks source link

Add parameter `na_string=` to .to_csv() #2166

Open junghoocho opened 4 years ago

junghoocho commented 4 years ago

I encountered an unexpected behavior of to_csv() and fread() regarding its handling of 'NA' string.

When I ran the following code,

import datatable as dt

data = dt.Frame(['a', 'NA'])
print(data[1,0], type(data[1,0]))
data.to_csv("NA-test.csv")
data = dt.fread("NA-test.csv")
print(data[1,0], type(data[1,0]))

I had expected that I would get the same string 'NA <class str>' and 'NA <class str>' from both print statements, because, after all, all I did was just to serialize data to a file and read it back. Unfortunately, the actual result is 'NA <class str>' and 'None <class NoneType>'.

The current behavior is "understandable" given that 'NA' is one of the default strings to represent an NA value. But the combined default behaviors of to_csv() and fread() led to a surprise that I did not expect at all.

I can work around this problem by adding the parameter na_strings=[""] to fread() like the following:

import datatable as dt

data = dt.Frame(['a', 'NA'])
print(data[1,0], type(data[1,0]))
data.to_csv("NA-test.csv")
data = dt.fread("NA-test.csv", na_strings=[""])
print(data[1,0], type(data[1,0]))

But this workaround won't work if I have a true NA value in another column.

Will it be possible to "escape" the output of the string 'NA' in to_csv() by default somehow, so that when it is read back, it remains to be the string 'NA', not the NA value?

st-pasha commented 4 years ago

The limitation of the CSV format is that it is not type-preserving. For example, if you have frame dt.Frame(["1.5", "2.0"]) of type str32, then serializing it to CSV and reading back will produce a column of type float64. There are even no NAs involved here.

The problem is that there is no notion of "type" in a CSV file; all fields are essentially strings. However, this is not very practical from user's perspective -- which is why we try to guess column types based on their content. The guess is usually good enough, but cannot be always perfect.

And even though I agree that having a round-trip datatable->CSV->datatable produce the same result as the original is a very desirable property to have, I do not think it is possible in practice.

There may be a solution in writing to a modified CSV format -- say, to ARFF or CSVY, or even JSON -- but we don't have those writers implemented yet (however, it's probably not that hard).

oleksiyskononenko commented 4 years ago

@junghoocho saving your dataset to CSV is a must or you can also consider saving it to a binary format? In the latter case you won't have any effects like you describe, because it will save an internal representation of a frame.

junghoocho commented 4 years ago

For my current need, na_strings = [""], is a good enough workaround because I don't have any NA value in any column. I was quite surprised with this issue when I was writing some data manipulation code, so I just wanted to report what I observed.

With my current needs aside, the main limitation that I perceive is that currently to_csv() does not allow me to specify the string representation of an NA value. If to_csv() takes an optional parameter like na_string="#NAVALUE", I expect it will handle most people's need, since we can pick a particular string that does not appear in our data when we call to_csv() and fread().

st-pasha commented 4 years ago

Adding the na_string=... parameter into .to_csv() sounds like a good idea.