jtablesaw / tablesaw

Java dataframe and visualization library
https://jtablesaw.github.io/tablesaw/
Apache License 2.0
3.55k stars 643 forks source link

Column type detection fails in XlsxReader #751

Closed Nadavbi87 closed 3 years ago

Nadavbi87 commented 4 years ago

Hi, First of all, thanks for this awesome package.

I'm using the XlsxReader to import excel file, in one of my tests I came across the following issue: When the reader auto-detect the column types and the first column value is empty, it automatically detects it as a String Column and creates a String typed column, till here everything is fine. When any other value in that column is a Numeric type(in my case all of them) it gets ignored and instead, it returns an empty value. So, as a result, I get a table with a column that all of the values are empty. I would except that I get all the values at least as a string and not empty value.

The logic is in the XlsxReader.private Column<?> appendValue(Column<?> column, Cell cell).

I try to solve it by using the "specifying the datatypes for each column" approach I did the following :

Source source1 = new Source(new ByteArrayInputStream(byteArrayOutputStream.toByteArray()));
List<ColumnType> typesTotalData = Arrays.asList(new ColumnType[] {ColumnType.STRING, ColumnType.STRING, ColumnType.STRING, ColumnType.STRING, ColumnType.INTEGER, ColumnType.INTEGER});
Table table = Table.read().usingOptions(XlsxReadOptions.builder(source1).sheetIndex(0).columnTypesToDetect(typesTotalData));

Then I realize that there is no use in the columnTypedToDetect in the XlsxReader.

*Another suggestion, it will be good if I'm using multiple sheets and using public List<Table> readMultiple(XlsxReadOptions options) throws IOException That I can pass a list of options that each one of them is bounded to a specific sheet, Because in the current situation if I want for example different name ( or different column type list) for each sheet/table I need to perform a multiple reads ( as the number of the sheets ) and set the options for each one even though under the hood for each call it goes over all the sheets and return the sheet that correspond the index the user pass in the options.

I 'm using the latest version 0.37.3.

Thanks, Nadav

lwhite1 commented 4 years ago

Hi, @Nadavbi87, Sorry you're having trouble with the library.

Is it possible to see if you can recreate this by first exporting the Excel file to CSV and see if it has the same problem on import?

ccleva commented 4 years ago

Hi all, I stumbled on this issue a couple of days ago and took the opportunity to dig a bit further:

I could reproduce the problem always by using one of the empty string indicators (TypeUtils.MISSING_INDICATORS) as the first value in the column (after the header). In this case the column is identified as a StringColumn and the following numeric values are ignored (the column is filled with empty values).

The problem does not occur with csv as the column type identification is done in a different way than with the excel reader (which rely on the cellType).

I think there are 2 different issues there:

  1. Creating a StringColumn despite a missing value indicator. Creation could be delayed until a non-missing value is read
  2. Columns with mixed-content are not handled properly: StringColumn are not filled with numerical values as String, and similarly numeric columns will not be converted to StringColumn if non numeric values are found (I have not checked how this is handled with csv).

I can provide a couple of (failing) tests and try to implement a solution to the 1st issue if you want

benmccann commented 4 years ago

We have a ColumnTypeDetector that is used by CsvReader, FixedWidthReader, and HtmlReader and handles both issues that you mention. I think that perhaps XlsxReader was written before ColumnTypeDetector existed.

The difficulty of using ColumnTypeDetector is that it takes Iterator<String[]> rows and in XlsxReader we have a org.apache.poi.ss.usermodel.Row instead of String[] for each row. I wonder if we could make ColumnTypeDetector a little more generic and subclass it for XlsxReader. There's a lot of logic in ColumnTypeDetector that'd be nice to share so that things work consistently across all the readers (e.g. when to use a StringColumn vs TextColumn, sampling, handling of mixed-content columns, etc.)

lujop commented 3 years ago

@benmccann Current type detection isn't based on the values, it's based on cell type on excel. I think that this is not bad, but would be needed to check all the values and select the column type that can accommodate all. In that case, I think that using ColumnTypeDetector isn't needed, isn't it?

benmccann commented 3 years ago

Ah, I wasn't aware Excel provided a cell type