osiegmar / FastCSV

CSV library for Java that is fast, RFC-compliant and dependency-free.
https://fastcsv.org/
MIT License
542 stars 93 forks source link

Support for coping with invalid quote chars #74

Closed dankito closed 1 year ago

dankito commented 1 year ago

First of all thanks a lot for your great work and this amazing library!

In a project i have to cope with some real world data from an external provider that contains quote chars within a cell like:

;"de:14612:735:1";"landwärts (Straße "Altkaitz")";"51,014313

"de:14627:4000:4";"am "Ross" (RV)";"51,163477"

"de:14628:1148:1";"Ri. "Am Windberg"";"51,002455"

To cope with these some little changes to RowReader would be required:

    boolean consume(final RowHandler rh, final char[] lBuf, final int lLen) {
        // ...

                        if (c == qChar) {
                            if (isInvalidQuoteChar(lPos, lLen, lBuf) == false) { // ignores invalid quote chars
                                lStatus &= ~STATUS_QUOTED_MODE;
                            }
                            continue mode_check;
                        }

        // ...
  }

    private boolean isInvalidQuoteChar(int lPos, int lLen, char[] lBuf) {
        if (lPos < lLen) {
            char nextChar = lBuf[lPos];
            if (nextChar != fsep && nextChar != CR && nextChar != LF) {
                return true;
            }
        }
        return false;
    }

So i would kindly ask you if it would be possible to integrate it? If so i could add an option to CsvReader(Builder) and provide a Pull Request for this.

osiegmar commented 1 year ago

Thanks for your feedback!

I do have a few concerns about adding this functionality as adding support for "even more broken" CSV data often comes with:

Although, I'm not completely against it. It highly depends on the implementation and test cases – including performance checks. I'd be curious to see a PR but I can't promise to merge.

Your example would probably not work when the quote character sits exactly on the last buffer position (when you have no nextChar without re-fetching data). Your third example ("de:14628:1148:1";"Ri. "Am Windberg"";"51,002455") raises an interesting question – what to do if the fields "seem" to contain valid and invalid quote characters? The result would only be Ri. "Am Windberg" if ignoring a proper escaped quote character!

BTW: I did a quick check within the https://github.com/osiegmar/JavaCsvComparison project and found that none of the tested libraries seem to support this broken format (in the way you're expecting it).

dankito commented 1 year ago

You're right in all three points. New features (almost always) lead to higher complexity. I introduced a boolean flag so the performance degree should be negligible. For tests see the PR.

"when the quote character sits exactly on the last buffer position": That's a good point and here i need your help. I made it that if lPos == lLen more data gets loaded:

if (c == qChar) {
    if (ignoreInvalidQuoteChars && lPos == lLen) { // end of buffer reached we first need to fetch more data
        // TODO: but on next iteration lPos is then lPos + 1 so quote char won't be checked again
        continue mode_check;
    }
    if (ignoreInvalidQuoteChars == false || isInvalidQuoteChar(lPos, lLen, lBuf) == false) {
        lStatus &= ~STATUS_QUOTED_MODE;
    }
    continue mode_check;
}

But then on next iteration lPos is incremented by one so that the quote char won't be checked again. So invalid quote chars at the last position of the buffer work, see test case ignoreInvalidQuoteChars_InvalidQuoteCharAtEndOfBuffer(). But valid quote chars at the end of the buffer don't work this way, see test case ignoreInvalidQuoteChars_ValidQuoteCharAtEndOfBuffer().

Can you give me any hint how to solve this? (I could introduce a property recheckLastCharacter if that's in your sense.)

osiegmar commented 1 year ago

This thing reminds me a bit of https://github.com/osiegmar/FastCSV/issues/67.

Probably you need to set the status to something you could check on in the next loop iteration – something like lStatus |= STATUS_LAST_CHAR_WAS_QUOTE before the continue mode_check.

dankito commented 1 year ago

That was the perfect hint, thank you! Now it works exactly as expected.

osiegmar commented 1 year ago

Thanks for your work on this. I studied the PR and thought quite some time about this. I also re-read the RFC and its upcoming revision.

Unfortunately I came to the conclusion, that I won't merge this into FastCSV.

I decided so, mainly because of the increased complexity, the fact that this ill-formed CSV format seems to be an extreme edge case and the main objective(s) of this project. Not even the corner cases in the RFC revision mentions unescaped quotes within quoted fields – only different quote escape characters.

Also I'm thinking about another CSV library for quite some time. A CSV library that main objectives could be clean code, maintainability and expandability. I'll keep this PR in mind for that.

dankito commented 1 year ago

I know it's licensed under MIT license, but would like to ask anyway: Would you mind me then forking your project and releasing it as Kotlin (Multiplatform) version including this fix?

osiegmar commented 1 year ago

Nope, feel free to do so.