vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.14k stars 1.6k forks source link

CSV support for `file` source #6767

Open jszwedko opened 3 years ago

jszwedko commented 3 years ago

User in discord was trying to figure out how to use remap to handle CSV files: https://discord.com/channels/742820443487993987/746070591097798688/821038938424344626

They are able to work with it by specifying the headers statically in their remap script like:

fields = split(.message, ",")
.SourceFile = fields[0]

But it seems like it'd be nice for the file source to support reading CSV files natively where it would generate events using the CSV header to name the fields of each line.

fdamstra commented 6 months ago

Coworker came up with a possible idea to handle CSVs (credit goes to him, not me for this idea and these words). We haven't tried this yet, but it seems like it could work for small files:

The unnest will create new messages, one per line downstream with a .metadata.header and a .message, which can be further parsed/split and matched with the header values (how?).

Main limitation here is filesize as to do this it needs to store the whole CSV in memory, so that's probably a show stopper for many use cases. If anyone has ideas for improvements that would get around that then it may be a feasible idea.

brandonburchett commented 2 months ago

Could a line_number prop be introduced in "read_from"? That way you can start at the second line and hardcode your headers as a local variable, then reference them against the array returned from parse_csv

Edit: a "beginning-skip-first-line" option would also work.