Eric24 commented 5 years ago

A rather common CSV format found in things like web server logs is to have one or more "header" lines at the beginning of the file, like this:

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields
2019-01-04  11:37:39    ORD50-C1    1030    0.0.0.0 GET xxxxxxxxxx.domain.net   /   200 -   Mozilla/5.0%2520(Windows%2520NT%252010.0;%2520Win64;%2520x64)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/71.0.3578.98%2520Safari/537.36  -   -   Test    qiW8xjO4CoRnsrW843xPY4fE_sKPZp7YYgpeskJhd3N9ZkArsmUvrw==    host.domain.com https   283 0.188   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss    HTTP/2.0    -   -

Obviously this library can't be expected to deal with every such header format, but I think this could be solved fairly simply, by adding two props to options: a prefix to identify such header lines; and a function (maybe called something like 'parseHeaders'). The idea being to allow a custom function to have a look at the first n lines of the file, until the function figures out the headers and returns them.

When the library starts processing a file, its in "seeking header information" state and not parsing data. Every line that starts with the specified prefix is passed to the parseHeaders function (the entire line). That function can either return a falsey value or a hash or array that contains the headers. Once the headers are returned, the library is now in "seeking data" state (but it will continue passing any line identified by the prefix to this function, in case there are other such headers after the one that contains the column headers).

In any case, the general idea of getting the column headers from the file is already there, but it's currently limited to only looking for headers on the first non-comment/non-blank row and then they must follow the same format as the data, with regard to delimiters, etc. This concept allows extendable support for more complex header formats, including cases where the column headers may not even be itemized in the file at all, but are instead identified by some sort of constant, etc.

I realize this is somewhat similar to #186, but in the case of processing a stream (i.e. in node), you have no control over the chunk size and can therefore for not rely on the entire header to be in the chunk. This concept works at the line level, after line-level parsing.

dboskovic commented 5 years ago

I think what might work is some sort of findHeaders callback that provides the reader instance and allows a user to write some custom logic that would extract and return the header along with an index for where "data starts". This way we're not trying to add support for specific edge cases, but allowing end developers to add some logic of their own where needed.

Eric24 commented 5 years ago

Interesting. I definitely agree that this should be as "generic" as possible, so it can support anything at the start of a file, even if it's only to ignore it. With what you describe, at what point would the custom logic get access to the file data and in what format? Currently, I'm handling this case with a separate transform function between the readstream and the parser. It's a little gross, since it must "accumulate" chunk data until it has gotten enough of the file to get to the column headers.

By the way, a headers property on the parser could be an alternative solution. For headerless files, it would allow the user to statically specify headers. Potentially, both features could be combined with the existing header property (false=no headers; true=get headers from first line of file; array=static list of headers; function=get headers with findHeaders function). Everything else works as-is, including transformHeader, which if supplied, would still get the individual headers, regardless of where they came from.

dboskovic commented 5 years ago

Yeah, that's a legitimate point. It'd have to be a function that gets the raw FileReader or I think it could get corrupted pretty easily by other parsing options.

609 suggests the `headers` option for the `unparse` functionality. Might make sense to consider it for `parse` as well.

Eric24 commented 5 years ago

Another thought on this topic: Perhaps a more generic way to handle this, and open up some other possibilities as well, is to have the option for a function that simply gets a "look" at each line as it's received, but before any other parsing is done. Something like transformHeader, but transformLine. The function can then either return the line as-is as a string (which is essentially the default behavior), return a manipulated version of it as a string, return null (to ignore it), or explicitly set the headers by returning an array. This would solve the "headers in strange formats" issue, as well as allowing "whole line" manipulation for other purposes. Support for headers as an array is still needed (both for headerless files and for the transformLine function), but it would eliminate the need for anything like the findHeaders function and make the whole thing more generic.