jtablesaw / tablesaw

Java dataframe and visualization library
https://jtablesaw.github.io/tablesaw/
Apache License 2.0
3.56k stars 645 forks source link

Asymmetric handling of null String values in TextColumn and StringColumn #1236

Open maxhillaert opened 1 year ago

maxhillaert commented 1 year ago

Consider the java test below.

    @Test
    public void assymetricNullHandling() {
        Table t1 = Table.create(TextColumn.create("str1"), TextColumn.create("str2"));
        Row row = t1.appendRow();
        row.setString("str1", null);
        row.setString("str2", "");
        StringWriter w = new StringWriter();
        t1.write().csv(w);
        String csv = w.toString();
        Table t2 = Table.read().csv(CsvReadOptions.builder(new StringReader(csv)).columnTypesToDetect(List.of(ColumnType.TEXT)).build());
        assertThat(t2.getString(0, "str1"), equalTo(null)); // fails as it's empty string ""
        assertThat(t2.getString(0, "str2"), equalTo("")); // succeed
    }

Null values are written as "" and read back as empty string "" , which is the missingValueIndicator for the StringParser. This kind of asymmetry where both null and "" are written as "" destroys the semantical difference between an empty value or a non existent value. Semantically, empty string is not a missing value per se. Empty string has to be distinguished from null.

The missingValueIndicator in StringParser is always "", so i don't see a way around this without just doing this handling myself by using "" or something in my own converters.

Am I missing something?

ccleva commented 11 months ago

Hi @maxhillaert. You are not missing anything.

If your use case requires to distinguish between empty strings and null strings, you have to manage this on your side by using a marker string for null values and workaround it.

Watch out as some common null marker strings are treated as empty values so you would run into the same issue. See #1244 for more information.