mattcasters / pentaho-pdi-dataset

Set of PDI plugins to more easily work with data sets. We also want to provide unit testing capabilities through input data sets and golden data sets.
Apache License 2.0
30 stars 13 forks source link

csv input data - sorting impossible with not used columns in mapping #50

Open peterborkuti opened 5 years ago

peterborkuti commented 5 years ago

Dear Matt,

When I am using csv file input for a unit test which contains two columns (for example "id" and "a"), but I am using only one of them in the mapping (for example "a") and I choose the other ("id") for sorting, an exception occurs:

2019/02/28 15:07:40 - Spoon - Caused by: org.pentaho.di.core.exception.KettleException: 
2019/02/28 15:07:40 - Spoon - Unable to get all rows for database data set 'addnumbers as text'
2019/02/28 15:07:40 - Spoon - -1
2019/02/28 15:07:40 - Spoon - 
2019/02/28 15:07:40 - Spoon -   at org.pentaho.di.dataset.DataSetCsvGroup.getAllRows(DataSetCsvGroup.java:226)
2019/02/28 15:07:40 - Spoon -   at org.pentaho.di.dataset.DataSetGroup.getAllRows(DataSetGroup.java:133)
2019/02/28 15:07:40 - Spoon -   at org.pentaho.di.dataset.DataSet.getAllRows(DataSet.java:140)
2019/02/28 15:07:40 - Spoon -   at org.pentaho.di.dataset.spoon.xtpoint.InjectDataSetIntoTransExtensionPoint.injectDataSetIntoStep(InjectDataSetIntoTransExtensionPoint.java:198)
2019/02/28 15:07:40 - Spoon -   at org.pentaho.di.dataset.spoon.xtpoint.InjectDataSetIntoTransExtensionPoint.callExtensionPoint(InjectDataSetIntoTransExtensionPoint.java:126)
2019/02/28 15:07:40 - Spoon -   ... 8 more
2019/02/28 15:07:40 - Spoon - Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
2019/02/28 15:07:40 - Spoon -   at org.pentaho.di.core.row.RowMeta.compare(RowMeta.java:915)
2019/02/28 15:07:40 - Spoon -   at org.pentaho.di.dataset.DataSetCsvGroup$1.compare(DataSetCsvGroup.java:214)
2019/02/28 15:07:40 - Spoon -   at org.pentaho.di.dataset.DataSetCsvGroup$1.compare(DataSetCsvGroup.java:211)
2019/02/28 15:07:40 - Spoon -   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
2019/02/28 15:07:40 - Spoon -   at java.util.TimSort.sort(TimSort.java:220)
2019/02/28 15:07:40 - Spoon -   at java.util.Arrays.sort(Arrays.java:1512)
2019/02/28 15:07:40 - Spoon -   at java.util.ArrayList.sort(ArrayList.java:1462)
2019/02/28 15:07:40 - Spoon -   at java.util.Collections.sort(Collections.java:175)
2019/02/28 15:07:40 - Spoon -   at org.pentaho.di.dataset.DataSetCsvGroup.getAllRows(DataSetCsvGroup.java:211)
2019/02/28 15:07:40 - Spoon -   ... 12 more

I debugged it and I think, here is the spot in the code: (DataSetCsvGroup.java from line 200)

      // Which fields are we sorting on (if any)
      //
      int[] sortIndexes = new int[ sortFields.size() ];
      for ( int i = 0; i < sortIndexes.length; i++ ) {
        sortIndexes[ i ] = outputRowMeta.indexOfValue( sortFields.get( i ) );
      }

      if ( !sortFields.isEmpty() ) {

        // Sort the rows...
        //
        Collections.sort( rows, new Comparator<Object[]>() {
          @Override public int compare( Object[] o1, Object[] o2 ) {
            try {
              return outputRowMeta.compare( o1, o2, sortIndexes );
            } catch ( KettleValueException e ) {
              throw new RuntimeException( "Unable to compare 2 rows", e );
            }
          }
        } );
      }

sortIndexes will not be empty, but sortIndexes[0] will be -1 and this will cause and ArrayIndexOutOfBounds exception in outputRowMeta.compare.

You may ask, why want I sorting the csv file base on a field, which is not in the mapping, but it seemed to me a normal use case. For example, I wanted to test a transformation which adds two numbers together:

id a b c
1 0 0 0
2 1 0 1

The input mapping would be the columns "a" and "b", sorted by "id" The golden mapping would be the columns "a", "b" and "c" sorted by "id".

I put all the files to reproduce this here: https://github.com/peterborkuti/pentaho-pdi-dataset-bug-01

Thank you for your wonderful plugin Péter

mattcasters commented 5 years ago

Hi Péter,

Thank you very much for the use case. It's true that I hadn't considered it yet. I think we'll need to do something novel here like adding the sort columns temporarily until after sorting after which we should remove them again, just to make sure the columns don't end up in the test-transformation. Cheers, Matt

JenniferJohnson89 commented 4 years ago

I noticed that there is a similar problem at https://github.com/mattcasters/pentaho-pdi-dataset. Perhaps we can refer to this issue to find more context about the bug.