uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
917 stars 252 forks source link

Wrong result for FixedWidthParser #511

Open pajusin opened 2 years ago

pajusin commented 2 years ago

FixedWidthParser returns wrong result if parsed row is smaller than annotation setting (from, to). See unittest

public static class LINE {
        public LINE() {
        }

        @Parsed
        @FixedWidth(from = 5, to = 10)
        String row;
    }

    @Test
    public void testFixedWidthAnnotation2() throws Exception {
        BeanListProcessor<LINE> rowProcessor = new BeanListProcessor<LINE>(LINE.class);
        FixedWidthParserSettings parserSettings = new FixedWidthParserSettings();
        parserSettings.setProcessor(rowProcessor);
        FixedWidthParser parser = new FixedWidthParser(parserSettings);

        parser.parse(new StringReader("     12123"));
        List<LINE> beans = rowProcessor.getBeans();
        assertEquals(beans.get(0).row, "12123"); // this is OK

        parser.parse(new StringReader(" 1"));
        beans = rowProcessor.getBeans();
        assertEquals(beans.get(0).row, ""); //returns wrong result 1, but should return "" or NULL, from position 5 to 10 characters in source row does not exists
    }
mjawadbutt commented 2 years ago

I faced a similar issue. Just to summarize again, there are 2 conditions that need to be true to reproduce the error:

  1. In fixed width parsing, there is a gap between the last and the second last field definition i.e. : : fixedWidthFields.addField("Serial no", 0, 6); //-- second last field .. DDMMYY fixedWidthFields.addField("Costing Date", 10, 16); //-- second last field .. DDMMYY fixedWidthFields.addField("Labor Cost Code", 20, 30); //-- last field .. Alphanumeric,10

(So there is a gap between the second last and the last field from position 16 to position 19)

  1. The last field in the data contains fewer characters than the gap field.

In this case, the value is assigned to the actual last column (i,.e. Labor Cost Code field) rather than being considered part of the gap field and ignored.

i.e. if row is: SNO___COSTIN_ABCD

so after parsing, the values of fields will be:
SNO_ COSTIN ABCD

Whereas they should be: SNO_ COSTIN null

As long as the last field contains characters <= gap field length, this error will manifest. As soon as we have more characters than the gap field the result will become correct.

MY WORKAROUND for this was to define the gap field explicitly and ignore it in the code.