nietras / Sep

World's Fastest .NET CSV Parser. Modern, minimal, fast, zero allocation, reading and writing of separated values (`csv`, `tsv` etc.). Cross-platform, trimmable and AOT/NativeAOT compatible.
http://nietras.com
MIT License
914 stars 34 forks source link

Sep preserves space after quoted column headers. #154

Closed bryanboettcher closed 1 month ago

bryanboettcher commented 2 months ago

Hi @nietras, me again 👋

We have a CSV with the following headers: "FIRSTNAME","LASTNAME","ADDRESS","CITY","STATE","ZIP"

Look very carefully -- the row has a single space after the closing quote after "ZIP". When parsing, the header for the last column is named ZIP, with the trailing space. IMO, this behavior is incorrect, since the column value was correctly enclosed in quotes.

Minimal repro with output:

    static void Main(string[] args)
    {
        const string FileData = @"""FIRSTNAME"",""LASTNAME"",""ADDRESS"",""CITY"",""STATE"",""ZIP"" 
""JOE"",""SMITH"",""123 MAIN ST"",""BEVERLY HILLS"",""CA"",""90210-1234""";

        using var stringReader = new StringReader(FileData);
        using var sepReader = Sep.Reader(o => o with
        {
            ColNameComparer = StringComparer.OrdinalIgnoreCase,
            CultureInfo = CultureInfo.InvariantCulture,
            Unescape = true,
            Sep = Sep.New(','),
            DisableQuotesParsing = false,
            DisableColCountCheck = true
        }).From(stringReader);

        foreach (var columnName in sepReader.Header.ColNames)
        {
            Console.WriteLine($"[{columnName}]");
        }
    }
[FIRSTNAME]
[LASTNAME]
[ADDRESS]
[CITY]
[STATE]
[ZIP ]

If you try and run this repro, make sure the FileData const has a single space after the closing quote on "ZIP". I'm not sure if GitHub markdown will strip that out or not.

nietras commented 1 month ago

@bryanboettcher thanks again for the nice repo and filing this issue. This behavior is by design. If you look in README at https://github.com/nietras/Sep?tab=readme-ov-file#unescaping this is covered by the "a"· case, and is identical to CsvHelper behavior if turning errors off. Note also that Sylvan simply throws in that case. The column is invalidly defined with that space after quote, so there is no true answer here.

Input Valid CsvHelper CsvHelper¹ Sylvan Sep²
"a"· False EXCEPTION EXCEPTION

Trimming before unescaping could help, and trimming is an issue filed in #74 but trimming has issues since some want trimming before unescaping, some want after unescaping, some want both etc. And yes, trimming will likely impact perf.

Would trimming solve your issue here? and what kind of trimming would you prefer? How would you want option for it to look?

bryanboettcher commented 1 month ago

Trimming would solve my issue here. My expectation is that if I have quote parsing and escaping on, that everything between the quotes would be considered the column name, regardless of whatever whitespace is outside the quotes.

What a mess. I don't envy you having to solve that. (but I'd like for you to! 😍)

nietras commented 1 month ago

Closing as tracked by #74 :)