Consistent handling of LineTokenizer formatting errors [BATCH-700]

spring-projects-issues commented 16 years ago

Lucas Ward opened BATCH-700 and commented

The various LineTokenizer implementations don't consistently handle errors in formatting, nor is there anyway to determine that an error was related to formatting rather than parsing. Below is an example from DelimitedLineTokenizer:

public void testTooFewNames() {
    tokenizer.setNames(new String[] {"A", "B"});
    try {
        tokenizer.tokenize("a,b,c");
        fail("Expected IllegalArgumentException");
    }
    catch (IllegalArgumentException e) {
        // expected
    }
}

public void testTooManyNames() {
    tokenizer.setNames(new String[] {"A", "B", "C", "D"});
    FieldSet line = tokenizer.tokenize("a,b,c");
    assertEquals(4, line.getFieldCount());
    assertEquals("c", line.readString("C"));
    assertEquals(null, line.readString("D"));
}

If there are less names than columns in the data, then an IllegalArgumentException will be thrown. However, if there are more names than columns, the extra data will be passed in as null, likely causing a parsing error in the FieldSetMapper. FixedLengthLineTokenizer is somewhat similar:

public void testLongerLinesRestIgnored() throws Exception {
    tokenizer.setColumns(new Range[] {new Range(1,10),new Range(11,25),new Range(26,30)});
    // test shorter line as defined by record descriptor
    line = "H1";
    FieldSet tokens = tokenizer.tokenize(line);
    // test longer lines => rest will be ignored
    line = "H1        12345678       1234567890";
    tokens = tokenizer.tokenize(line);
    assertEquals(3, tokens.getFieldCount());
    assertEquals(line.substring(0, 10).trim(), tokens.readString(0));
    assertEquals(line.substring(10, 25).trim(), tokens.readString(1));
    assertEquals(line.substring(25, 30).trim(), tokens.readString(2));      
}

If there is more data in the line than the ranges allow for, the extra data is just ignored. However, if the input string is smaller than the ranges allow, empty strings are passed for the three fields:

public void testTokenizeSmallerStringThanRanges() {
    tokenizer.setColumns(new Range[] {new Range(1,5),new Range(6,10),new Range(11,15)});
    FieldSet tokens = tokenizer.tokenize("12345");
    assertEquals(3, tokens.getFieldCount());
    assertEquals("12345", tokens.readString(0));
    assertEquals("", tokens.readString(1));
    assertEquals("", tokens.readString(2));
}

An empty string however, will yield a fieldset with 0 fields rather than 3 empty fields:

   public void testTokenizeEmptyString() {
    tokenizer.setColumns(new Range[] {new Range(1,5),new Range(6,10),new Range(11,15)});
    FieldSet tokens = tokenizer.tokenize("");
    assertEquals(0, tokens.getFieldCount());
}

The reason I began investigating how bad formatting is handled came up due to a use case at a client. There's workflow for a batch job that separates whether or not a record is bad because it was formatted incorrectly, versus a parsing error. However, because of the above behavior, I have no reliable way to determine why a line failed. For example, if I expect there to be 3 columns in a delimited file (by setting the names) and there's only two, there will be an IllegalArgumentException that gets wrapped in a FlatFileParseException, and the FieldSetMapper will never be called. However, if there's more columns than I expected (indicating an extra comma) then I will get back a FieldSet, but need to manually check for any extra columns to ensure that it was formatted correctly, from which I can throw an Exception, that will get wrapped in a FlatFileParseException (which I have to do in order to get the original line) that I can then log out. The only way I can get this to work is by not setting the names, and manually checking the size. Then, in my listener, I can check the cause of the FlatFileParseException to see if it is a FormatException or a NumberFormatException (caused by trying to parse an alpha into a BigDecimal) None of it really behaved the way I expected it to. I expected that if I configured a tokenizer to have a certain number of columns (whether by setting ranges or names) that the tokenizer would throw an exception when it encountered a record with too few or too many columns. Of course, if it did I suppose I would be in bigger trouble, since there would be no way to determine exactly why a FlatFileParseException was thrown. In my mind there are three separate types of errors:

Format - the line wasn't in the correct format, either it was missing a comma, or had more data than expected, etc)
Parse - The data within a column couldn't be parsed to the expected type
Validation - the data was correctly formatted and parsed by doesn't fall within accepted ranges for business reasons.

For this reason, i would like LineTokenizers to throw FlatFileFormatException ( a new exception class ) if there are any errors encountered formatting, with FieldSetMapper throwing FlatFileParseException. This would allow for correctly determining the cause of an error when reading a file.

Another approach would be to never throw an exception and leave it to the developer to validate the FieldSet themselves. However, to me this violates the principle of least astonishment. I expect that if I configured a LineTokenizer to tokenize lines with certain properties, that it will throw an exception if the line doesn't match them. Although, I suppose there could be a property such as 'enforceFormatErrors' that would determine the approach.

Affects: 1.0.1

Referenced from: commits https://github.com/spring-projects/spring-batch/commit/402bb81c409b1b0856404ce7fa3c8685ae697fa3, https://github.com/spring-projects/spring-batch/commit/4156e441bad7b593300fbf78f6cd442a3aaafd10, https://github.com/spring-projects/spring-batch/commit/3757b239d007483bb4f480012377b4095ce9e0ec, https://github.com/spring-projects/spring-batch/commit/fa5bcc334441984f33edb8b3a4c69563faf20829

spring-projects-issues commented 16 years ago

Lucas Ward commented

Both Delimited and FixedLength tokenizers should be more consistent:

Delimited:

If no names are set, there will be no checks. If the line is null or empty, an empty FieldSet will be returned.
If names are set and the number of tokens doesn't match the number of names, then an IncorrectTokenCountException will be thrown, containing the expected count and the actual count.

FixedLength:

The length of the line will always be checked against the highest number in the ranges (regardless of their order). The only way in which an empty or null line can pass is if an empty Range[] is set.
Lines that don't match the correct range will throw an IncorrectLineLengthException, containing the expected length and the actual length.
If the number of names doesn't match the number of ranges, an IncorrectTokenCountException will be thrown, unless the line length doesn't match the ranges, in which case an IncorrectLineLengthException will be thrown. (incorrect line length will always take precedence)

I ultimately decided that if someone didn't want the strictness in delimited, they could not set names, and if they had extra input in the FixedLength that they didn't care about, they could still provide a range indicating that they expected it to be there (otherwise we have no way of knowing)

spring-projects-issues commented 16 years ago

Dave Syer commented

Assume closed as resolved and released

spring-projects / spring-batch

Consistent handling of LineTokenizer formatting errors [BATCH-700] #2877