jtablesaw / tablesaw

Java dataframe and visualization library
https://jtablesaw.github.io/tablesaw/
Apache License 2.0
3.55k stars 643 forks source link

XlsxReader casting integers cells to blanks #882

Open Afollet opened 3 years ago

Afollet commented 3 years ago

Hello devs,

Thanks for making/maintaining this library. I am currently in a microservice that validates the data in excel spreadsheets.

I'm reading an excel workbook with multiple sheets in tablesaw-excel.

public static List<Table> readXlsx(InputStream inputStream) throws IOException {
    Source source = new Source(inputStream);
    XlsxReadOptions options = XlsxReadOptions.builder(source).build();
    XlsxReader reader = new XlsxReader();
    return reader.readMultiple(options);
  }

I have cell types in Excel of both numeric and text in a column. When the table is read in, the numeric cell types become blank cells. Therefore, when I try to pull up missing values I get the rows where the numeric type cells are at

Table missing = table.where(table.column("myColumn").isMissing());

Is there some way I can change the behavior of this using XlsxReadOptions? Or is there another method I can use to avoid this behavior?

Thanks!

Afollet commented 3 years ago

So, I made some changes to the XlsxReader to change the behavior as I needed ( I also made some changes so it's easier to use the code base in our project): https://github.com/PDXFinder/pdx-validator/blob/dev/src/main/java/org/pdxfinder/validator/ValidatorXlsxReader.java

I made two changes:

  1. Change the method "appendValue" to return a string representation of a numeric cell if the column type is String.
  2. Change "isBlank" to not count numeric cells with a value of 0 as blank

I added a basic test for this behavior: https://github.com/PDXFinder/pdx-validator/blob/dev/src/test/java/org/pdxfinder/ValidatorXlsxReaderTest.java

I would be willing to make a PR for these changes on this repo. I was not sure if the maintainers are interested in these changes, so let me know.

lujop commented 3 years ago

I think that it will be solved with #909 if we also do an improvement in type detection like commented on #751

With #909 the reader it's able to read number values in String columns And if #751 it's solved will be a more consistent type detection that doesn't depend on the first cell type

lujop commented 3 years ago

@Afollet If you have time to check and can test with the current master version I think that this issue must be solved

Afollet commented 3 years ago

@lujop Can I pull down the snapshot of this version of master? I think that would be easiest for me to test it, but I'm was not able to pull it on https://oss.sonatype.org/

Could you advise me on how to use maven to test these changes?

lujop commented 3 years ago

@Afollet you can just clone the repo and do a mvn install and it will build and install the tablesaw jar in your local maven repository. Then use the dependency with the new version as normal and while in the same machine maven will take the dependency from your local repository.

Afollet commented 3 years ago

@lujop Yes, I have tested your changes and it has resolved this issue.

Thanks for the great work @lujop . Dev's like you make the world go round.