uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
896 stars 246 forks source link

Strange maxCharsPerColumn behavior #113

Closed mumrah closed 7 years ago

mumrah commented 7 years ago

I haven't narrowed this down yet, but as of 2.2.1 I'm seeing strange behavior when using maxCharsPerColumn.

Here is a unit test that isolates the problem (along with one of our test files):

public void testUnivocity() throws Exception {
    String csv = "# Real world CSV taken from http://www.ferc.gov/docs-filing/eqr/soft-tools/sample-csv.asp\n" +
        "contract_id,seller_company_name,customer_company_name,customer_duns_number,contract_affiliate,FERC_tariff_reference,contract_service_agreement_id,contract_execution_date,contract_commencement_date,contract_termination_date,actual_termination_date,extension_provision_description,class_name,term_name,increment_name,increment_peaking_name,product_type_name,product_name,quantity,units_for_contract,rate,rate_minimum,rate_maximum,rate_description,units_for_rate,point_of_receipt_control_area,point_of_receipt_specific_location,point_of_delivery_control_area,point_of_delivery_specific_location,begin_date,end_date,time_zone\n" +
        "C71,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Original Volume No. 10,2,2/15/2001,2/15/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES\n" +
        "C72,The Electric Company,Utility A,38495837,n,FERC Electric Tariff Original Volume No. 10,15,7/25/2001,8/1/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,,,,,,ES\n" +
        "C73,The Electric Company,Utility B,493758794,N,FERC Electric Tariff Original Volume No. 10,7,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep\n" +
        "C74,The Electric Company,Utility C,594739573,n,FERC Electric Tariff Original Volume No. 10,25,6/8/2001,7/6/2001,,,Evergreen,N/A,N/A,N/A,N/A,MB,ENERGY,0,, , , ,Market Based,,,, , ,,,ep\n" +
        "C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,ENERGY,2000,KWh,.1475, , ,Max amount of capacity and energy to be transmitted.  Bill based on monthly max delivery to City.,$/KWh,PJM,Point A,PJM,Point B,,,ep\n" +
        "C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,point-to-point agreement,2000,KW,0.01, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep\n" +
        "C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,network,2000,KW,0.2, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep\n" +
        "C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,BLACK START SERVICE,2000,KW,0.22, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep\n" +
        "C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,CAPACITY,2000,KW,0.04, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep\n" +
        "C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,regulation & frequency response,2000,KW,0.1, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep\n" +
        "C75,The Electric Company,The Power Company,456543333,N,FERC Electric Tariff Third Revised Volume No. 7,94,2/13/2001,7/1/2001,12/31/2006,,None,F,LT,M,P,T,real power transmission loss,2000,KW,7, , ,,$/kw-mo,PJM,Point A,PJM,Point B,,,ep\n" +
        "C76,The Electric Company,The Power Company,456534333,N,FERC Electric Tariff Original Volume No. 10,132,12/15/2001,1/1/2002,12/31/2004,12/31/2004,None,F,LT,M,FP,MB,CAPACITY,70,MW,3750, , ,70MW for each and every hour over the term of the agreement (7x24 schedule).,$/MW,,,,,,,ep\n" +
        "C78,The Electric Company,\"The Electric Marketing Co., LLC\",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,35, , ,,$/MWH,,,PJM,Bus 4321,20020101,20030101,EP\n" +
        "C78,The Electric Company,\"The Electric Marketing Co., LLC\",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,37, , ,,$/MWH,,,PJM,Bus 4321,20030101,20040101,EP\n" +
        "C78,The Electric Company,\"The Electric Marketing Co., LLC\",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,39, , ,,$/MWH,,,PJM,Bus 4321,20040101,20050101,EP\n" +
        "C78,The Electric Company,\"The Electric Marketing Co., LLC\",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,41, , ,,$/MWH,,,PJM,Bus 4321,20050101,20060101,EP\n" +
        "C78,The Electric Company,\"The Electric Marketing Co., LLC\",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,43, , ,,$/MWH,,,PJM,Bus 4321,20060101,20070101,EP\n" +
        "C78,The Electric Company,\"The Electric Marketing Co., LLC\",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,45, , ,,$/MWH,,,PJM,Bus 4321,20070101,20080101,EP\n" +
        "C78,The Electric Company,\"The Electric Marketing Co., LLC\",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,47, , ,,$/MWH,,,PJM,Bus 4321,20080101,20090101,EP\n" +
        "C78,The Electric Company,\"The Electric Marketing Co., LLC\",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,49, , ,,$/MWH,,,PJM,Bus 4321,20090101,20100101,EP\n" +
        "C78,The Electric Company,\"The Electric Marketing Co., LLC\",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,51, , ,,$/MWH,,,PJM,Bus 4321,20100101,20110101,EP\n" +
        "C78,The Electric Company,\"The Electric Marketing Co., LLC\",23456789,Y,FERC Electric Tariff Original Volume No. 2,Service Agreement 1,1/2/1992,1/2/1992,1/1/2012,,Renewable annually by mutual agreement after termination date.,UP,LT,Y,FP,CB,ENERGY,0,MWH,53, , ,,$/MWH,,,PJM,Bus 4321,20110101,20120101,EP\n";
    CsvParserSettings settings = new CsvParserSettings();
    settings.setMaxCharsPerColumn(100);
    settings.getFormat().setDelimiter('\t');
    com.univocity.parsers.csv.CsvParser parser = new com.univocity.parsers.csv.CsvParser(settings);
    parser.beginParsing(new StringReader(csv));
    String[] row;
    while((row = parser.parseNext()) != null) {
      long lineNumber = parser.getContext().currentLine();
      String line = parser.getContext().currentParsedContent();
      System.err.println(lineNumber + ": " + line.length());
      //System.err.println(Arrays.asList(row));
    }
    parser.stopParsing();
  }

This outputs

2: 622
3: 191
4: 182
5: 183
6: 184
7: 312
8: 233
9: 215
10: 228
11: 217
12: 239
13: 234
14: 278
15: 301
16: 301
17: 301
18: 301
19: 301
20: 301
21: 301
22: 301
23: 301

com.univocity.parsers.common.TextParsingException: Length of parsed input (101) exceeds the maximum number of characters defined in your parser settings (100). 
Hint: Number of characters processed may have exceeded limit of 100 characters per column. Use settings.setMaxCharsPerColumn(int) to define the maximum number of characters a column can have
Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
    Auto configuration enabled=true
    Autodetect column delimiter=false
    Autodetect quotes=false
    Column reordering enabled=true
    Empty value=null
    Escape unquoted values=false
    Header extraction enabled=null
    Headers=null
    Ignore leading whitespaces=true
    Ignore trailing whitespaces=true
    Input buffer size=1048576
    Input reading on separate thread=true
    Keep escape sequences=false
    Keep quotes=false
    Length of content displayed on error=-1
    Line separator detection enabled=false
    Maximum number of characters per column=100
    Maximum number of columns=512
    Normalize escaped line separators=true
    Null value=null
    Number of records to read=all
    Processor=none
    Restricting data in exceptions=false
    RowProcessor error handler=null
    Selected fields=none
    Skip empty lines=true
    Unescaped quote handling=nullFormat configuration:
    CsvFormat:
        Comment character=#
        Field delimiter=\t
        Line separator (normalized)=\n
        Line separator sequence=\n
        Quote character="
        Quote escape character="
        Quote escape escape character=null
Internal state when error was thrown: line=23, column=0, record=22, charIndex=6218, content parsed=C78,The Electric Company,"The Electric Marketing Co., LLC",23456789,Y,FERC Electric Tariff Original 

    at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
    at com.lucidworks.apollo.pipeline.parse.impl.text.CsvParserTest.testUnivocity(CsvParserTest.java:453)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
    at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
    at org.testng.TestRunner.privateRun(TestRunner.java:767)
    at org.testng.TestRunner.run(TestRunner.java:617)
    at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
    at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:329)
    at org.testng.SuiteRunner.privateRun(SuiteRunner.java:291)
    at org.testng.SuiteRunner.run(SuiteRunner.java:240)
    at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
    at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
    at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224)
    at org.testng.TestNG.runSuitesLocally(TestNG.java:1149)
    at org.testng.TestNG.run(TestNG.java:1057)
    at org.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:72)
    at org.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:124)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 100
    at com.univocity.parsers.common.input.DefaultCharAppender.appendUntil(DefaultCharAppender.java:216)
    at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:136)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:444)
    ... 29 more

I would expect it to fail on the first line.

I traced through things a bit in the Univocity classes and noticed that AbstractCharInputReader is creating a 4096 ExpandingCharAppender ("tmp"). Don't know enough about how things are working in there, but it seems possible that stuff is being read from a buffer which doesn't have the 100 char limit (as set in my test).

For my purposes, I'm mostly interested in preventing OOM when a user mis-configures the parser. Since it does seem to eventually use the correct reader and properly fail, I'll just update my test to workaround this for now.

Thanks!

jbax commented 7 years ago

Fixed. This affects the CSV parser only when processing unquoted values.

The behavior just got inconsistent after the latest optimization for version 2.2.1, and you won't get OutOfMemoryError: the length of each parsed String will be limited to the internal buffer size in the worst case.

Values that exceed the buffer length - or partially stored in the current buffer - are parsed using the original algorithm and the maximum length restriction will be applied.

I've just released a 2.2.2-SNAPSHOT version to include the fix for this.