Open junghoocho opened 4 years ago
The limitation of the CSV format is that it is not type-preserving. For example, if you have frame dt.Frame(["1.5", "2.0"])
of type str32, then serializing it to CSV and reading back will produce a column of type float64. There are even no NAs involved here.
The problem is that there is no notion of "type" in a CSV file; all fields are essentially strings. However, this is not very practical from user's perspective -- which is why we try to guess column types based on their content. The guess is usually good enough, but cannot be always perfect.
And even though I agree that having a round-trip datatable->CSV->datatable produce the same result as the original is a very desirable property to have, I do not think it is possible in practice.
There may be a solution in writing to a modified CSV format -- say, to ARFF or CSVY, or even JSON -- but we don't have those writers implemented yet (however, it's probably not that hard).
@junghoocho saving your dataset to CSV is a must or you can also consider saving it to a binary format? In the latter case you won't have any effects like you describe, because it will save an internal representation of a frame.
For my current need, na_strings = [""]
, is a good enough workaround because I don't have any NA value in any column. I was quite surprised with this issue when I was writing some data manipulation code, so I just wanted to report what I observed.
With my current needs aside, the main limitation that I perceive is that currently to_csv()
does not allow me to specify the string representation of an NA value. If to_csv()
takes an optional parameter like na_string="#NAVALUE"
, I expect it will handle most people's need, since we can pick a particular string that does not appear in our data when we call to_csv()
and fread()
.
Adding the na_string=...
parameter into .to_csv()
sounds like a good idea.
I encountered an unexpected behavior of
to_csv()
andfread()
regarding its handling of 'NA' string.When I ran the following code,
I had expected that I would get the same string 'NA <class str>' and 'NA <class str>' from both print statements, because, after all, all I did was just to serialize data to a file and read it back. Unfortunately, the actual result is 'NA <class str>' and 'None <class NoneType>'.
The current behavior is "understandable" given that 'NA' is one of the default strings to represent an NA value. But the combined default behaviors of
to_csv()
andfread()
led to a surprise that I did not expect at all.I can work around this problem by adding the parameter
na_strings=[""]
tofread()
like the following:But this workaround won't work if I have a true NA value in another column.
Will it be possible to "escape" the output of the string 'NA' in
to_csv()
by default somehow, so that when it is read back, it remains to be the string 'NA', not the NA value?