mafintosh / csv-parser

Streaming csv parser inspired by binary-csv that aims to be faster than everyone else
MIT License
1.41k stars 134 forks source link

Option 'reduceValues' to give users control over the object constructed for eah row #174

Open AaronHarris opened 4 years ago

AaronHarris commented 4 years ago

Feature Proposal

Provide a new option to the options object that allows users to choose how to modify the output object based on each cell in the row. Documentation provided below:

reduceValues

Type: Function

A function that can be used to modify the object that is emitted by the stream, similar to a reducer function. The return value will replace the existing object used accumulating columns in the row. If null is returned, the rest of the row is skipped (similar to mapHeaders).

csv({
  reduceValues: ({ memo, header, index, value }) => memo[header] === '' ? (memo[header] = value, memo) : memo
})

Parameters

memo Object or any The current object representing the values in a row header String The current column header. index Number The current column index. value String or any The current column value (or content).

If both mapValues and reduceValues functions are provided, mapValues is run first, and the output is provided in the value parameter of the reduceValues function.

Feature Use Case

This feature gives the developer more control over constructing the object that is passed through the stream, or choosing to skip a row based on the values encountered for a row. Some specific example use case:

The goal of this proposal is to allow developers more freedom over the deserialization without increasing maintenance burden for the maintainers.

Why not just make the user transform the object downstream from the CSV parser. After all, this is only a CSV parser!!!

Moving the object reducing logic upstream can help maintain the fast speed this library is known for, as well as allow users to build their own workaround for issues like #150. If a user has more control over skipping a row, or can stop deserializing the rest of the row, there are performance advantages, especially in very wide data sets or ones with lots of rows to be skipped. In addition, some behaviors, like when there are duplicate header names, cannot be addressed downstream.