uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
905 stars 249 forks source link

Lookahead not working as expected for last line without newline character #453

Open loganathan87 opened 3 years ago

loganathan87 commented 3 years ago

Use-case: I'm trying to use the lookahead pattern to segregate the different datasets we receive in a fixed width file. For a specific dataset the lookahead pattern I use and the length of the entire row is exactly the same and I see some unexpected behaviour in this scenario.

More details on the code and expected vs actual outputs can be found below,

Code Block (Could be used to Replicate the isssue)

Note: Simplified our use-case to just explain the core of the issue we see.

fun testLookAhead(contents: String) {

        var parserSettings = FixedWidthParserSettings()
        parserSettings.format.padding = ' '
        parserSettings.format.setLineSeparator("\n")

        var deleteFields = FixedWidthFields(1,4,3,2,4,4,2)
        var createFields = FixedWidthFields(1,4,3,2,4,4,2, 27)

        parserSettings.addFormatForLookahead("2?????????????????01", deleteFields)
        parserSettings.addFormatForLookahead("2?????????????????02", createFields)

        var parser = FixedWidthParser(parserSettings)

        (parser.parseAll(StringReader(contents))).forEach { println(Arrays.toString(it)) }
    }

Scenario-1: (Getting expected output) When contents passed is

20123003020761012301
20123003020769012301
20123002010394012302Some description comes here

I get the expected output of,

[2, 0123, 003, 02, 0761, 0123, 01]
[2, 0123, 003, 02, 0769, 0123, 01]
[2, 0123, 002, 01, 0394, 0123, 02, Some description comes here]

Scenario-2: (Getting into exception) But when contents passed is

20123003020761012301
20123003020769012301

The expected output is,

[2, 0123, 003, 02, 0761, 0123, 01]
[2, 0123, 003, 02, 0769, 0123, 01]

But I get the below exception,

produced error: com.univocity.parsers.common.TextParsingException - Cannot process input with the given configuration. No default field lengths defined and no lookahead/lookbehind value match '2012300302076901230'
Internal state when error was thrown: line=2, column=0, record=1, charIndex=41
Parser Configuration: FixedWidthParserSettings:
    Auto configuration enabled=true
    Auto-closing enabled=true
    Column reordering enabled=true
    Field lengths=<null>
    Header extraction enabled=null
    Headers=null
    Ignore leading whitespaces=true
    Ignore trailing whitespaces=true
    Input buffer size=1048576
    Input reading on separate thread=true
    Length of content displayed on error=-1
    Line separator detection enabled=false
    Lookahead formats={2?????????????????02=
        1   null, length: 1, align: LEFT, padding: , keepPadding: null
        2   null, length: 4, align: LEFT, padding: , keepPadding: null
        3   null, length: 3, align: LEFT, padding: , keepPadding: null
        4   null, length: 2, align: LEFT, padding: , keepPadding: null
        5   null, length: 4, align: LEFT, padding: , keepPadding: null
        6   null, length: 4, align: LEFT, padding: , keepPadding: null
        7   null, length: 2, align: LEFT, padding: , keepPadding: null
        8   null, length: 27, align: LEFT, padding: , keepPadding: null, 2?????????????????01=
        1   null, length: 1, align: LEFT, padding: , keepPadding: null
        2   null, length: 4, align: LEFT, padding: , keepPadding: null
        3   null, length: 3, align: LEFT, padding: , keepPadding: null
        4   null, length: 2, align: LEFT, padding: , keepPadding: null
        5   null, length: 4, align: LEFT, padding: , keepPadding: null
        6   null, length: 4, align: LEFT, padding: , keepPadding: null
        7   null, length: 2, align: LEFT, padding: , keepPadding: null}
    Lookbehind formats={}
    Maximum number of characters per column=4096
    Maximum number of columns=512
    Null value=null
    Number of records to read=all
    Processor=none
    Record ends on new line=false
    Restricting data in exceptions=false
    RowProcessor error handler=null
    Selected fields=none
    Skip bits as whitespace=true
    Skip empty lines=true
    Skip trailing characters until new line=falseFormat configuration:
    FixedWidthFormat:
        Comment character=#
        Line separator (normalized)=\n
        Line separator sequence=\n
        Padding= 
Internal state when error was thrown: line=2, column=0, record=1, charIndex=41
com.univocity.parsers.common.TextParsingException: com.univocity.parsers.common.TextParsingException - Cannot process input with the given configuration. No default field lengths defined and no lookahead/lookbehind value match '2012300302076901230'
Internal state when error was thrown: line=2, column=0, record=1, charIndex=41
Parser Configuration: FixedWidthParserSettings:
    Auto configuration enabled=true
    Auto-closing enabled=true
    Column reordering enabled=true
    Field lengths=<null>
    Header extraction enabled=null
    Headers=null
    Ignore leading whitespaces=true
    Ignore trailing whitespaces=true
    Input buffer size=1048576
    Input reading on separate thread=true
    Length of content displayed on error=-1
    Line separator detection enabled=false
    Lookahead formats={2?????????????????02=
        1   null, length: 1, align: LEFT, padding: , keepPadding: null
        2   null, length: 4, align: LEFT, padding: , keepPadding: null
        3   null, length: 3, align: LEFT, padding: , keepPadding: null
        4   null, length: 2, align: LEFT, padding: , keepPadding: null
        5   null, length: 4, align: LEFT, padding: , keepPadding: null
        6   null, length: 4, align: LEFT, padding: , keepPadding: null
        7   null, length: 2, align: LEFT, padding: , keepPadding: null
        8   null, length: 27, align: LEFT, padding: , keepPadding: null, 2?????????????????01=
        1   null, length: 1, align: LEFT, padding: , keepPadding: null
        2   null, length: 4, align: LEFT, padding: , keepPadding: null
        3   null, length: 3, align: LEFT, padding: , keepPadding: null
        4   null, length: 2, align: LEFT, padding: , keepPadding: null
        5   null, length: 4, align: LEFT, padding: , keepPadding: null
        6   null, length: 4, align: LEFT, padding: , keepPadding: null
        7   null, length: 2, align: LEFT, padding: , keepPadding: null}
    Lookbehind formats={}
    Maximum number of characters per column=4096
    Maximum number of columns=512
    Null value=null
    Number of records to read=all
    Processor=none
    Record ends on new line=false
    Restricting data in exceptions=false
    RowProcessor error handler=null
    Selected fields=none
    Skip bits as whitespace=true
    Skip empty lines=true
    Skip trailing characters until new line=falseFormat configuration:
    FixedWidthFormat:
        Comment character=#
        Line separator (normalized)=\n
        Line separator sequence=\n
        Padding= 
Internal state when error was thrown: line=2, column=0, record=1, charIndex=41
    at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:623)
    at com.univocity.parsers.common.AbstractParser.internalParseAll(AbstractParser.java:552)
    at com.univocity.parsers.common.AbstractParser.parseAll(AbstractParser.java:545)
    at com.univocity.parsers.common.AbstractParser.parseAll(AbstractParser.java:532)
    at com.target.itemlegacysource.service.ProcessingService.testLookAhead(ProcessingService.kt:81)
    at com.target.itemlegacysource.service.ProcessingService.processingCoordinator(ProcessingService.kt:59)
    at com.target.itemlegacysource.listener.ListenerService$receive$1.invokeSuspend(ListenerService.kt:31)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:274)
    at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:84)
    at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
    at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
    at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
    at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
    at com.target.itemlegacysource.listener.ListenerService.receive(ListenerService.kt:30)
    at com.target.itemlegacysource.listener.$ListenerServiceDefinition$$exec2.invokeInternal(Unknown Source)
    at io.micronaut.context.AbstractExecutableMethod.invoke(AbstractExecutableMethod.java:151)
    at io.micronaut.core.bind.DefaultExecutableBinder$1.invoke(DefaultExecutableBinder.java:109)
    at io.micronaut.configuration.kafka.processor.KafkaConsumerProcessor.lambda$process$8(KafkaConsumerProcessor.java:495)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: com.univocity.parsers.common.TextParsingException: Cannot process input with the given configuration. No default field lengths defined and no lookahead/lookbehind value match '2012300302076901230'
Internal state when error was thrown: line=2, column=0, record=1, charIndex=41
    at com.univocity.parsers.fixed.FixedWidthParser.parseRecord(FixedWidthParser.java:189)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:581)
    ... 24 common frames omitted

Scenario-3: (Getting expected output by adding a newline character to end of Scenario-2 input)

When contents passed is

20123003020761012301
20123003020769012301

I get the expected output of,

[2, 0123, 003, 02, 0761, 0123, 01]
[2, 0123, 003, 02, 0769, 0123, 01]

I'm not sure if i'm missing some configuration or it's more of a bug. Looking for some help on this.

codewithpriya commented 5 months ago

Use below code :

settings.setRecordEndsOnNewline(true);

It worked for me after hours of struggle