mattcasters / pentaho-pdi-dataset

Set of PDI plugins to more easily work with data sets. We also want to provide unit testing capabilities through input data sets and golden data sets.
Apache License 2.0
30 stars 13 forks source link

Linefeeds in a single field will trick the CSV being imported for the dataset. #40

Closed usbrandon closed 5 years ago

usbrandon commented 5 years ago

Plugin Version 3.4.2

In the ktr attached to the JIRA below, there is a Data Grid step where an input has carrage returns within one of the columns. This must be confusing the CSV Reader for the unit test and it gives up and refuses to import the data. Removing the offending row solves the issue.

https://jira.pentaho.com/browse/PDI-17034

Console output of the CSV file used for input.

cod_unico_old,id_unico_old,id_cliente_fn,id_cliente_fc,ID_UNICO,COD_UNICO
2525114547,N/A335,335,N/A,N/A335,507F4FFD5322035F7C7A0466FD66
2525114547,N/A1035,1035,N/A,N/A1035,355EFD4DFD37CAFD7AFDFD481A27FD
2525114547,N/A1062,1062,N/A,N/A1062,FD37FD2DFD0F74603EFD5E3CFDFDEE
2525114547,N/A1227,1227,N/A,N/A1227,0C1FFDFDFDFD473DFD206C27FDFDFD1C
2525114547,N/A1560
N/A1634
N/A5074
N/A5023,1560,N/A,N/A1560,FDFD13EEFD2FFD3F5EFD1F265658FD
2525114547,N/A1634,1634,N/A,N/A1634,7922FD100C776D762D45FD57132EFD
2525114547,N/A5074,5074,N/A,N/A5074,2D28FD6C5979FD02FD73FD4B4D0FFD5C
2525114547,N/A5023,5023,N/A,N/A5023,FD6946A15BFD61FDFD373926FDFD09
2525114547,N/A5030,5030,N/A,N/A5030,FDFDFD5651FD570EFDFD7DFD48FDFD09
mattcasters commented 5 years ago

We should try with double quotes around the offending data. There is no CSV file standard but if there was, the Commons CSV code is very close to it.

mattcasters commented 5 years ago

Working with the last patch, give it a shot.

2525114547,N/A335,335,N/A
2525114547,N/A1035,1035,N/A
2525114547,N/A1062,1062,N/A
2525114547,N/A1227,1227,N/A
2525114547,"N/A1560
N/A1634
N/A5074
N/A5023",1560,N/A
2525114547,N/A1634,1634,N/A
2525114547,N/A5074,5074,N/A
2525114547,N/A5023,5023,N/A
2525114547,N/A5030,5030,N/A
2525114547,N/A5037,5037,N/A
2525114547,N/A5043,5043,N/A
2525114547,N/A5045,5045,N/A
2525114547,N/A5047,5047,N/A
2525114547,N/A5013,5013,N/A
2525114547,N/A5014,5014,N/A
2525114547,N/A5015,5015,N/A
2525114547,N/A5017,5017,N/A
usbrandon commented 5 years ago

Not fixed yet. Please try with this CSV. Had to rename to .txt to make github happy. PDI-17034_checksum_case-Input.txt

-- I get a null pointer exception "2" coming back when trying to view or run this dataset.

org.pentaho.di.core.exception.KettleException: Unable to get all rows for CSV data set 'PDI-17034_checksum_case-input' 2

at org.pentaho.di.dataset.DataSetCsvGroup.getAllRows(DataSetCsvGroup.java:131)
at org.pentaho.di.dataset.DataSetGroup.getAllRows(DataSetGroup.java:113)
at org.pentaho.di.dataset.DataSet.getAllRows(DataSet.java:144)
at org.pentaho.di.dataset.spoon.dialog.DataSetDialog.viewData(DataSetDialog.java:566)
at org.pentaho.di.dataset.spoon.dialog.DataSetDialog$7.handleEvent(DataSetDialog.java:337)
at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source)
at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
at org.pentaho.di.dataset.spoon.dialog.DataSetDialog.open(DataSetDialog.java:369)
at org.pentaho.di.dataset.spoon.DataSetHelper.editDataSet(DataSetHelper.java:376)
at org.pentaho.di.dataset.spoon.DataSetHelper.editDataSet(DataSetHelper.java:365)
at org.pentaho.di.dataset.spoon.xtpoint.ShowUnitTestMenuExtensionPoint.lambda$callExtensionPoint$5(ShowUnitTestMenuExtensionPoint.java:101)
at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source)
at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
at org.pentaho.di.ui.spoon.Spoon.readAndDispatch(Spoon.java:1381)
at org.pentaho.di.ui.spoon.Spoon.waitForDispose(Spoon.java:7817)
at org.pentaho.di.ui.spoon.Spoon.start(Spoon.java:9179)
at org.pentaho.di.ui.spoon.Spoon.main(Spoon.java:707)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.pentaho.commons.launcher.Launcher.main(Launcher.java:92)

Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:79) at org.pentaho.di.dataset.DataSetCsvGroup.getAllRows(DataSetCsvGroup.java:121) ... 25 more

mattcasters commented 5 years ago

Sorry, Brandon, why would you think that simply copying any CSV file into the datasets would work? Please only create datasets with the plugin. Alternatively, apply proper quoting in the file.