mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.41k stars 1.14k forks source link

[Idea/Suggestion] Improved support for "header" lines in input files #612

Open Eric24 opened 5 years ago

Eric24 commented 5 years ago

A rather common CSV format found in things like web server logs is to have one or more "header" lines at the beginning of the file, like this:

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields
2019-01-04  11:37:39    ORD50-C1    1030    0.0.0.0 GET xxxxxxxxxx.domain.net   /   200 -   Mozilla/5.0%2520(Windows%2520NT%252010.0;%2520Win64;%2520x64)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/71.0.3578.98%2520Safari/537.36  -   -   Test    qiW8xjO4CoRnsrW843xPY4fE_sKPZp7YYgpeskJhd3N9ZkArsmUvrw==    host.domain.com https   283 0.188   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss    HTTP/2.0    -   -

Obviously this library can't be expected to deal with every such header format, but I think this could be solved fairly simply, by adding two props to options: a prefix to identify such header lines; and a function (maybe called something like 'parseHeaders'). The idea being to allow a custom function to have a look at the first n lines of the file, until the function figures out the headers and returns them.

When the library starts processing a file, its in "seeking header information" state and not parsing data. Every line that starts with the specified prefix is passed to the parseHeaders function (the entire line). That function can either return a falsey value or a hash or array that contains the headers. Once the headers are returned, the library is now in "seeking data" state (but it will continue passing any line identified by the prefix to this function, in case there are other such headers after the one that contains the column headers).

In any case, the general idea of getting the column headers from the file is already there, but it's currently limited to only looking for headers on the first non-comment/non-blank row and then they must follow the same format as the data, with regard to delimiters, etc. This concept allows extendable support for more complex header formats, including cases where the column headers may not even be itemized in the file at all, but are instead identified by some sort of constant, etc.

I realize this is somewhat similar to #186, but in the case of processing a stream (i.e. in node), you have no control over the chunk size and can therefore for not rely on the entire header to be in the chunk. This concept works at the line level, after line-level parsing.

dboskovic commented 5 years ago

I think what might work is some sort of findHeaders callback that provides the reader instance and allows a user to write some custom logic that would extract and return the header along with an index for where "data starts". This way we're not trying to add support for specific edge cases, but allowing end developers to add some logic of their own where needed.

Eric24 commented 5 years ago

Interesting. I definitely agree that this should be as "generic" as possible, so it can support anything at the start of a file, even if it's only to ignore it. With what you describe, at what point would the custom logic get access to the file data and in what format? Currently, I'm handling this case with a separate transform function between the readstream and the parser. It's a little gross, since it must "accumulate" chunk data until it has gotten enough of the file to get to the column headers.

By the way, a headers property on the parser could be an alternative solution. For headerless files, it would allow the user to statically specify headers. Potentially, both features could be combined with the existing header property (false=no headers; true=get headers from first line of file; array=static list of headers; function=get headers with findHeaders function). Everything else works as-is, including transformHeader, which if supplied, would still get the individual headers, regardless of where they came from.

dboskovic commented 5 years ago

Yeah, that's a legitimate point. It'd have to be a function that gets the raw FileReader or I think it could get corrupted pretty easily by other parsing options.

609 suggests the headers option for the unparse functionality. Might make sense to consider it for parse as well.

Eric24 commented 5 years ago

Another thought on this topic: Perhaps a more generic way to handle this, and open up some other possibilities as well, is to have the option for a function that simply gets a "look" at each line as it's received, but before any other parsing is done. Something like transformHeader, but transformLine. The function can then either return the line as-is as a string (which is essentially the default behavior), return a manipulated version of it as a string, return null (to ignore it), or explicitly set the headers by returning an array. This would solve the "headers in strange formats" issue, as well as allowing "whole line" manipulation for other purposes. Support for headers as an array is still needed (both for headerless files and for the transformLine function), but it would eliminate the need for anything like the findHeaders function and make the whole thing more generic.

theLAZYmd commented 2 years ago

3 years on, this issue is still salient

pokoli commented 2 years ago

@theLAZYmd feel free to propose a solution for it

theLAZYmd commented 2 years ago

I'll draft a PR, but I think it should be possible to provide a config.headerLines option and modify the config.transformHeaders function to have two more params - acc, the existing header for that option, and j, the row number. The param then uses those to generate the new header as it iterates over rows in an reducer fashion.

image image

theLAZYmd commented 2 years ago

I'm happy to write the code and make the PR but can someone make further commits to the PR to update the documentation? Because of the parameter order, this should also be non-breaking but if we wanted to edit the parameter order to make it more convenient ([header, index, acc, rowNumber] is a bit unconventional) then it would be breaking

pokoli commented 2 years ago

@theLAZYmd I do not mind breaking the API, we just publish a new major version and it should be fixed. But I really care about the documentation, so please include the documentation in the PR if you are going to work on it. Also we require to have tests to ensure the behaviour is right.

Also I will like to see a valid use case for it (i do not want to add more complexity without any reason). The above mentioned example should be fixed by setting activating the comments flag.

theLAZYmd commented 2 years ago

Ok. My judgement is - the formulation of the function which doesn't break the API is so inoffensive to use that it is not worth breaking the API to correct to the 'ideal' way to implement it. Understood about documentation. Editing that now for the PR. I wrote one test which is passing (which is the same number as the existing transformHeader tests). No other tests broken.

theLAZYmd commented 2 years ago

Use case: it's fairly common for publishers of data (especially financial data) to have multiple lines as their headers, usually because the CSV data is exported from an Excel file. Please see an attached sheet as an example which is data from the ICAP. Adding support for transforming multiple lines into a consistent header, using a custom parser is basically:

theLAZYmd commented 2 years ago

Proposed in https://github.com/mholt/PapaParse/pull/898 :) Hope to see it implemented soon! If so, we also need to update https://github.com/DefinitelyTyped/DefinitelyTyped/tree/master/types/papaparse (how are these generated by the way?)

theLAZYmd commented 2 years ago

Just to mention, I've started using this patch in my production code with quite pleasing results!