spring-projects / spring-batch

Spring Batch is a framework for writing batch applications using Java and Spring
http://projects.spring.io/spring-batch/
Apache License 2.0
2.71k stars 2.34k forks source link

Consistent handling of LineTokenizer formatting errors [BATCH-700] #2877

Closed spring-projects-issues closed 16 years ago

spring-projects-issues commented 16 years ago

Lucas Ward opened BATCH-700 and commented

The various LineTokenizer implementations don't consistently handle errors in formatting, nor is there anyway to determine that an error was related to formatting rather than parsing. Below is an example from DelimitedLineTokenizer:

public void testTooFewNames() {
    tokenizer.setNames(new String[] {"A", "B"});
    try {
        tokenizer.tokenize("a,b,c");
        fail("Expected IllegalArgumentException");
    }
    catch (IllegalArgumentException e) {
        // expected
    }
}

public void testTooManyNames() {
    tokenizer.setNames(new String[] {"A", "B", "C", "D"});
    FieldSet line = tokenizer.tokenize("a,b,c");
    assertEquals(4, line.getFieldCount());
    assertEquals("c", line.readString("C"));
    assertEquals(null, line.readString("D"));
}

If there are less names than columns in the data, then an IllegalArgumentException will be thrown. However, if there are more names than columns, the extra data will be passed in as null, likely causing a parsing error in the FieldSetMapper. FixedLengthLineTokenizer is somewhat similar:

public void testLongerLinesRestIgnored() throws Exception {
    tokenizer.setColumns(new Range[] {new Range(1,10),new Range(11,25),new Range(26,30)});
    // test shorter line as defined by record descriptor
    line = "H1";
    FieldSet tokens = tokenizer.tokenize(line);
    // test longer lines => rest will be ignored
    line = "H1        12345678       1234567890";
    tokens = tokenizer.tokenize(line);
    assertEquals(3, tokens.getFieldCount());
    assertEquals(line.substring(0, 10).trim(), tokens.readString(0));
    assertEquals(line.substring(10, 25).trim(), tokens.readString(1));
    assertEquals(line.substring(25, 30).trim(), tokens.readString(2));      
}

If there is more data in the line than the ranges allow for, the extra data is just ignored. However, if the input string is smaller than the ranges allow, empty strings are passed for the three fields:

public void testTokenizeSmallerStringThanRanges() {
    tokenizer.setColumns(new Range[] {new Range(1,5),new Range(6,10),new Range(11,15)});
    FieldSet tokens = tokenizer.tokenize("12345");
    assertEquals(3, tokens.getFieldCount());
    assertEquals("12345", tokens.readString(0));
    assertEquals("", tokens.readString(1));
    assertEquals("", tokens.readString(2));
}

An empty string however, will yield a fieldset with 0 fields rather than 3 empty fields:

   public void testTokenizeEmptyString() {
    tokenizer.setColumns(new Range[] {new Range(1,5),new Range(6,10),new Range(11,15)});
    FieldSet tokens = tokenizer.tokenize("");
    assertEquals(0, tokens.getFieldCount());
}

The reason I began investigating how bad formatting is handled came up due to a use case at a client. There's workflow for a batch job that separates whether or not a record is bad because it was formatted incorrectly, versus a parsing error. However, because of the above behavior, I have no reliable way to determine why a line failed. For example, if I expect there to be 3 columns in a delimited file (by setting the names) and there's only two, there will be an IllegalArgumentException that gets wrapped in a FlatFileParseException, and the FieldSetMapper will never be called. However, if there's more columns than I expected (indicating an extra comma) then I will get back a FieldSet, but need to manually check for any extra columns to ensure that it was formatted correctly, from which I can throw an Exception, that will get wrapped in a FlatFileParseException (which I have to do in order to get the original line) that I can then log out. The only way I can get this to work is by not setting the names, and manually checking the size. Then, in my listener, I can check the cause of the FlatFileParseException to see if it is a FormatException or a NumberFormatException (caused by trying to parse an alpha into a BigDecimal) None of it really behaved the way I expected it to. I expected that if I configured a tokenizer to have a certain number of columns (whether by setting ranges or names) that the tokenizer would throw an exception when it encountered a record with too few or too many columns. Of course, if it did I suppose I would be in bigger trouble, since there would be no way to determine exactly why a FlatFileParseException was thrown. In my mind there are three separate types of errors:

  1. Format - the line wasn't in the correct format, either it was missing a comma, or had more data than expected, etc)
  2. Parse - The data within a column couldn't be parsed to the expected type
  3. Validation - the data was correctly formatted and parsed by doesn't fall within accepted ranges for business reasons.

For this reason, i would like LineTokenizers to throw FlatFileFormatException ( a new exception class ) if there are any errors encountered formatting, with FieldSetMapper throwing FlatFileParseException. This would allow for correctly determining the cause of an error when reading a file.

Another approach would be to never throw an exception and leave it to the developer to validate the FieldSet themselves. However, to me this violates the principle of least astonishment. I expect that if I configured a LineTokenizer to tokenize lines with certain properties, that it will throw an exception if the line doesn't match them. Although, I suppose there could be a property such as 'enforceFormatErrors' that would determine the approach.


Affects: 1.0.1

Referenced from: commits https://github.com/spring-projects/spring-batch/commit/402bb81c409b1b0856404ce7fa3c8685ae697fa3, https://github.com/spring-projects/spring-batch/commit/4156e441bad7b593300fbf78f6cd442a3aaafd10, https://github.com/spring-projects/spring-batch/commit/3757b239d007483bb4f480012377b4095ce9e0ec, https://github.com/spring-projects/spring-batch/commit/fa5bcc334441984f33edb8b3a4c69563faf20829

spring-projects-issues commented 16 years ago

Lucas Ward commented

Both Delimited and FixedLength tokenizers should be more consistent:

Delimited:

FixedLength:

I ultimately decided that if someone didn't want the strictness in delimited, they could not set names, and if they had extra input in the FixedLength that they didn't care about, they could still provide a range indicating that they expected it to be there (otherwise we have no way of knowing)

spring-projects-issues commented 16 years ago

Dave Syer commented

Assume closed as resolved and released