haskell-hvr / cassava

A CSV parsing and encoding library optimized for ease of use and high performance
http://hackage.haskell.org/package/cassava
BSD 3-Clause "New" or "Revised" License
222 stars 105 forks source link

Per-line parser reporting per-line errors #210

Open MaxGabriel opened 2 years ago

MaxGabriel commented 2 years ago

Right now, the functions in Data.Csv generally return Either String (Vector a) or a similar variant, which is great in most cases.

If you want to get errors on a per-line basis, one needs to use Data.Csv.Incremental*. This is unfortunate because the complexity of it is significantly higher (because it also is supporting interleaving IO and incrementally feeding data to the parser).

I think adding convenience functions for taking a ByteString and returning Vector (Either String a), without going through the Incremental functions, could be a good addition. The main use case I have in mind for those is providing better error messages for user-provided CSVs.

andreasabel commented 2 years ago

I think adding convenience functions for taking a ByteString and returning Vector (Either String a),

Yeah, that is thinkable. The current parser (e.g. for without header), https://github.com/haskell-hvr/cassava/blob/c821c8366ac4ce4ee3929e315ba37e694ad56f04/src/Data/Csv/Parser.hs#L69-L75 uses the parser modifier sepByEndOfLine': https://github.com/haskell-hvr/cassava/blob/c821c8366ac4ce4ee3929e315ba37e694ad56f04/src/Data/Csv/Parser.hs#L92-L105 So, it is just a single parse with a single error returned. If you want per-line parsing, you will first have to split the input using a variant of sepByEndOfLine' and then map a line parser over it.

PR welcome, if it comes with benchmarks comparing the performance of single parse versus per line parse.

MaxGabriel commented 2 years ago

Ok, I'm not interested in the performance implications of this, so just going to close

andreasabel commented 2 years ago

Let's leave it open, maybe others are interested.