kbss-cvut / s-pipes

Tool for execution of RDF-based pipelines.
GNU Lesser General Public License v3.0
4 stars 5 forks source link

[#228] Skeleton code to implement unified tabular processing #229

Open blcham opened 10 months ago

blcham commented 10 months ago

Implements #228

rodionnv commented 7 months ago

@blcham Please have a look at the current implementation of the #228 . There are also two test fails with merged XLS and XLSX files. The reason is that current implementation also produces statements for the merged cells, like here:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" ;
        <http://onto.fel.cvut.cz/data/bb>
                "" ;
        <http://onto.fel.cvut.cz/data/cc>
                "" .

while the expected output is:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .

This can be fixed with a control if(cell.getCellType() == CellType.BLANK)continue; But, it will ignore not only merged cells, but also cells that are empty, but filled with color, or contains table border for example. If it's not acceptable, then I believe it will require to store the information about every cell if it's merged or not, which will require more memory. Please share your opinion about this.

blcham commented 6 months ago

Adding missing context:

blcham commented 6 months ago

@blcham Please have a look at the current implementation of the #228 . There are also two test fails with merged XLS and XLSX files. The reason is that current implementation also produces statements for the merged cells, like here:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" ;
        <http://onto.fel.cvut.cz/data/bb>
                "" ;
        <http://onto.fel.cvut.cz/data/cc>
                "" .

while the expected output is:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .

This can be fixed with a control if(cell.getCellType() == CellType.BLANK)continue; But, it will ignore not only merged cells, but also cells that are empty, but filled with color, or contains table border for example. If it's not acceptable, then I believe it will require to store the information about every cell if it's merged or not, which will require more memory. Please share your opinion about this.

@rodionnv : 1) I believe the expected output now is correct 2) i suggest removing bb from the example so we have also an example of an empty cell ... could you run it on main branch to find out how it is serialized to csv?

rodionnv commented 6 months ago

@blcham Please have a look at the current implementation of the #228 . There are also two test fails with merged XLS and XLSX files. The reason is that current implementation also produces statements for the merged cells, like here:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" ;
        <http://onto.fel.cvut.cz/data/bb>
                "" ;
        <http://onto.fel.cvut.cz/data/cc>
                "" .

while the expected output is:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .

This can be fixed with a control if(cell.getCellType() == CellType.BLANK)continue; But, it will ignore not only merged cells, but also cells that are empty, but filled with color, or contains table border for example. If it's not acceptable, then I believe it will require to store the information about every cell if it's merged or not, which will require more memory. Please share your opinion about this.

@rodionnv :

  1. I believe the expected output now is correct
  2. i suggest removing bb from the example so we have also an example of an empty cell ... could you run it on main branch to find out how it is serialized to csv?

@blcham In main it seems that it's impossible to have empty cells. I have changed input.xls so it looks like this image And got this error: image

Now, in the current implementation in this PR, empty cells are ignored in the same way both in excel and html files, like here: image Output:

<http://test-file#row-3>
        <http://onto.fel.cvut.cz/data/aa>
                "merged rows" ;
        <http://onto.fel.cvut.cz/data/cc>
                "ee" .
blcham commented 6 months ago

Now, in the current implementation in this PR, empty cells are ignored in the same way both in excel and html files, like here:

Yes, it makes sense like that, and for the first non-header row, we should generate:

<http://test-file#row-2>
        <http://onto.fel.cvut.cz/data/aa>
                "merged columns" .
blcham commented 6 months ago

There is only way to continue on this. Make small PRs that can be merged immediately without breaking existing implementation. It should be easy to review (at most 15 mins for me).