redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
7.99k stars 789 forks source link

CSV parsing skips first column #2117

Open webfrank opened 9 months ago

webfrank commented 9 months ago

Hi I have a simple CSV:

Date;Country;City;Start HRFull;Total Users;Conference Id
2023/08/05;Czech Republic;;00:08:59;1;7f6de524-1edd-4a2f-a96b-36b4afc68705
2023/08/05;Czech Republic;;00:33:01;1;7f6de524-1edd-4a2f-a96b-36b4afc68705
2023/08/05;Czech Republic;;06:58:24;1;e07f5389-6f2b-46fc-a3a6-f1627169d2fc
2023/08/05;Czech Republic;;06:58:47;1;e07f5389-6f2b-46fc-a3a6-f1627169d2fc
2023/08/05;Czech Republic;;06:59:51;1;e07f5389-6f2b-46fc-a3a6-f1627169d2fc
2023/08/05;Czech Republic;;08:01:52;1;7f6de524-1edd-4a2f-a96b-36b4afc68705
2023/08/05;Czech Republic;;08:20:00;1;7f6de524-1edd-4a2f-a96b-36b4afc68705
2023/08/05;Czech Republic;;08:32:53;1;2ea6df2c-5378-4669-85f7-0c46beb42bbc
2023/08/05;Czech Republic;;08:39:08;1;7f6de524-1edd-4a2f-a96b-36b4afc68705

and a simple pipeline:

input:
  file:
    paths:
    - test.csv
    codec: csv:;

pipeline:
  processors:
  - mapping: |
      root = [
        this.Date,
        this.Country,
        this.City
      ]

in the output the first element (this.Date) is null:

[null,"Czech Republic",""]
[null,"Czech Republic",""]
[null,"Czech Republic",""]
[null,"Czech Republic",""]
[null,"Czech Republic",""]
[null,"Czech Republic",""]
[null,"Czech Republic",""]
[null,"Czech Republic",""]
[null,"Czech Republic",""]

but if I remove the pipeline the output is:

{"City":"","Conference Id":"7f6de524-1edd-4a2f-a96b-36b4afc68705","Country":"Czech Republic","Start HRFull":"00:08:59","Total Users":"1","Date":"2023/08/05"}
{"City":"","Conference Id":"7f6de524-1edd-4a2f-a96b-36b4afc68705","Country":"Czech Republic","Start HRFull":"00:33:01","Total Users":"1","Date":"2023/08/05"}
{"City":"","Conference Id":"e07f5389-6f2b-46fc-a3a6-f1627169d2fc","Country":"Czech Republic","Start HRFull":"06:58:24","Total Users":"1","Date":"2023/08/05"}
{"City":"","Conference Id":"e07f5389-6f2b-46fc-a3a6-f1627169d2fc","Country":"Czech Republic","Start HRFull":"06:58:47","Total Users":"1","Date":"2023/08/05"}
{"City":"","Conference Id":"e07f5389-6f2b-46fc-a3a6-f1627169d2fc","Country":"Czech Republic","Start HRFull":"06:59:51","Total Users":"1","Date":"2023/08/05"}
{"City":"","Conference Id":"7f6de524-1edd-4a2f-a96b-36b4afc68705","Country":"Czech Republic","Start HRFull":"08:01:52","Total Users":"1","Date":"2023/08/05"}
{"City":"","Conference Id":"7f6de524-1edd-4a2f-a96b-36b4afc68705","Country":"Czech Republic","Start HRFull":"08:20:00","Total Users":"1","Date":"2023/08/05"}
{"City":"","Conference Id":"2ea6df2c-5378-4669-85f7-0c46beb42bbc","Country":"Czech Republic","Start HRFull":"08:32:53","Total Users":"1","Date":"2023/08/05"}
{"City":"","Conference Id":"7f6de524-1edd-4a2f-a96b-36b4afc68705","Country":"Czech Republic","Start HRFull":"08:39:08","Total Users":"1","Date":"2023/08/05"}

so "Date" is parsed correctly from CSV

If I add a dummy first column this.Date get the right value.

It does not work on macOS, I tested on Linux and it works as expected.

Issue is related to line endings. The CSV was exported from Excel, rewriting it worked. Probably CSV parser should handle every line ending combination.

Jeffail commented 9 months ago

Hey @webfrank, I suspect this might be an issue in your environment on macOS, I just tried it on my macbook and the example works fine, gives me:

["2023/08/05","Czech Republic",""]
["2023/08/05","Czech Republic",""]
["2023/08/05","Czech Republic",""]
["2023/08/05","Czech Republic",""]
["2023/08/05","Czech Republic",""]
["2023/08/05","Czech Republic",""]
["2023/08/05","Czech Republic",""]
["2023/08/05","Czech Republic",""]
["2023/08/05","Czech Republic",""]
webfrank commented 9 months ago

Hi, it works because you are copying directly what I have pasted. This works to me also.

Making this I understood it was a line ending issue.

I'll attach the original converted file. test.csv

Edited the file with an HEX editor

Schermata 2023-09-16 alle 12 09 20

Excel adds three bytes which are not visible and when you try to access the field it doesn't work although it seems correct.

The three bytes are the BOM (Byte Order Mark) which Excel adds and expect to correctly parse a CSV. Is it possible to handle BOM and Non-BOM headers directly in Benthos?

peczenyj commented 9 months ago

I had a lot of issues with BOM in the past, so it makes some sense. It maybe create a field as “FE BB BF Date”

can you try to list the keys inside the pipeline?

webfrank commented 9 months ago

Hi, the issue is there, if BOM is present (and not printable), will be included in first field key but you will not be able to access it.

I've created a pull request (https://github.com/benthosdev/benthos/pull/2118) to fix this using this library: https://github.com/dimchansky/utfbom

webfrank commented 9 months ago

Ok, it seems skipbom is already present as additional codec "skipbom/csv". Probably a reference in the CSV codec would help finding it. One last thing, the CSV input comonent should have an option to enable skip BOM at this point to be fully compatible with the codec.