gnieh / fs2-data

streaming data parsing and transformation library
https://fs2-data.gnieh.org
Apache License 2.0
152 stars 27 forks source link

re-engineering of fs2-data-csv #611

Open mberndt123 opened 4 months ago

mberndt123 commented 4 months ago

This PR is meant as a basis for discussion of the kinds of changes that I would like to see in fs2-data-csv 2.0.

My goals are:

Here's what I've done so far:

All of this is nowhere near complete. For example, I think the Applicative instance for RowDecoder can be made more compositional, and I'll be happy to discuss that in detail later. There is also no Monad instance for CsvRowDecoder yet, which is due to the fact that I'd like to completely process the CSV header up-front and never look at it again while parsing the CSV rows. It's possible to do this using Selective Applicative Functors.

Alright, that should be enough to get the discussion started.

ybasket commented 4 months ago

@mberndt123 Thanks for the PR! I now had a closer look at your proposal, see inline comments below. Overall, there are some interesting ideas, but also some points where I disagree or am unsure about the reasoning. Let's discuss :)

  • a simpler API that is easier to understand. By this I primarily mean removing many of the type parameters

The type params were there for reasons. Your changes fix the header type to String which is less flexible than the current design that allows headers to be any type that could be handled by ParseableHeader. Whether that's a capability that we need can be discussed, but I'd argue that the simplicity win is not that big because users that wanted a simple "read this CSV to a stream of case class instances" or alike didn't have to care about the header type param so far, derivation just did its thing for String headers.

  • a more compositional API. Users shouldn't need to implement traits like CsvRowEncoder themselves. Instead, there should be a few small, builtin primitive instances and a suitable set of combinators to combine these into larger ones. Ideally CsvRowDecoder & friends would be sealed

What benefit do you see by sealing them? I can agree to the part of providing nice combinators to create instances, but fail to see the gain of forbidding users to write their own instances if they have special needs.

  • I think we can achieve better performance. It isn't necessary to build a CsvRow with field names every time we parse a row, the fields can be accessed by index. NonEmptyList is also an inefficient representation and we can use Array instead. Given the new sealed, compositional API, we probably don't even need to expose the CsvRow representation to the user at all, giving us much more flexibility for future evolution of the library

NonEmptyList is as efficient as List which the PR uses and has the significant advantage of being more correct in terms of semantics (an empty row makes no sense in CSV). fs2-data focuses on correctness and maintainable code, performance is optimised second, even if that makes for a bit slower results. If Array can be a pure implementation detail, we can consider it, but usually the required ClassTag makes this impossible/come at a cost.

  • derivation should be usable with Scala 3's derives keyword

👍 Agreed. It's on my Wishlist for 2.0, too.

  • fail-fast for header-based CSV decoding. If a necessary column is missing, the decoder should fail immediately when parsing the header, not later when parsing the individual rows

I have to think more about this one. Many use cases won't experience a difference as failing on the headers instead of the first row is little change. It also is limiting, imagine the following dummy example:

enum Data {
  case A(foo: String) // we could also have some discriminator field in both cases, not just distinct fields
  case B(bar: Int)
}

If a CSV only contained a column foo, the current approach would parse all rows just fine as Data.A. If the CSV decoder instance had to state upfront which columns are required, what would you imagine this statement to be? And could the file still be parsed? Also keep in mind columns can be purely optional as of now, as long as decoder knows how to handle this (case class default values, for example), this is just fine.

  • for the CsvRowEncoder, it should be possible to generate the header without having any actual data available. The current model is problematic when serializing algebraic data types (sealed traits)

Yeah, sounds reasonable. Would you mind elaborating on your statement on ADTs though?

mberndt123 commented 3 months ago

Hi @ybasket, thank you for your review, and apologies for how long it has taken me to respond.

Your changes fix the header type to String which is less flexible than the current design

The choice of String isn't some arbitrary restriction, it is inherent in the domain we're working in. CSV headers are strings, and so are the names of case class fields, which is why the typeclass derivation stuff can only generate the typeclass instances for CSV files with String headers. That makes String special enough that generalizing over it needs some justification, and I just can't see that – I can't for the life of me think of a scenario where I would need CSV headers of any type other than String. OTOH it does make the signatures quite a bit harder to read, so I just don't think it's worth it. The only other library I know that does something similar is Circe, but I would argue that's a different case because JSON objects are also used as a representation of Map, and it is common to have Maps with non-String keys. But CSV can't really represent a Map because the rows are a fixed size.

What benefit do you see by sealing them? I can agree to the part of providing nice combinators to create instances, but fail to see the gain of forbidding users to write their own instances if they have special needs.

It exposes what could be completely opaque implementation details to users, like the representation of a Row. For instance, if you change the type from NonEmptyList to Array (because List is a terrible data structure in terms of performance), that is an incompatible change right now. If you use a sealed trait (and make the methods private), you're free to change this because the user will only be able to create instances using restricted factory methods that are easy to keep compatible. It also makes it easier to discover the best way to use the library, which is building larger parsers from smaller ones using combinators. I have seen many instances where people were manually implementing interfaces (e. g. io.circe.Decoder) when a much better way would have been to use suitable combinators. A good library interface is one that steers users towards the best way of using it.

NonEmptyList is as efficient as List which the PR uses and has the significant advantage of being more correct in terms of semantics (an empty row makes no sense in CSV).

An empty CSV row makes as much sense as the number 0, the type Nothing, the identity function etc: it serves as a neutral element of composition, which is why it is required to make RowEncoder into an instance of ContravariantMonoidal. The only benefit of a NonEmptyList is that you can call methods like head or reduce on it, and this isn't done anywhere in the fs2-data-csv codebase, which leads me to believe it's not very useful. I'd rather have the compositionality that a ContravariantMonoidal instance gives me.

If Array can be a pure implementation detail, we can consider it, but usually the required ClassTag makes this impossible/come at a cost.

ClassTag is only required when creating new Arrays of a generic type, therefore I don't think it's going to be an issue.

If a CSV only contained a column foo, the current approach would parse all rows just fine as Data.A. If the CSV decoder instance had to state upfront which columns are required, what would you imagine this statement to be? And could the file still be parsed? If that's a requirement then it wouldn't be hard to extend the CsvRowDecoder interface a bit:

sealed trait CsvRowDecoder[T] {
def apply(headers: List[String], delayErrors: Boolean): Either[DecoderError, RowDecoder[T]]
}

If delayErrors is true, then apply would always return a Right, even if the required headers can't be found, but the RowDecoder inside the Right would always fail with an error. Note that this is another change that can only be made in a backwards compatible way if the implementation details of CsvRowDecoder are hidden and only available through a more restrictive API.

Yeah, sounds reasonable. Would you mind elaborating on your statement on ADTs though? It should probably be possible to compose the CsvRowEncoder of a sealed trait from the CsvRowEncoder instances of the individual case classes. But if you encode a case class using a CsvRowEncoder in the current model, then I think it's not going to have all the column names of the other case classes' fields, right? I haven't used fs2-data-csv for any encoding or sealed traits, so I'm not 100% sure on this one and need to think it through better.

Unfortunately I'm somewhat short on time right now, so I'm not sure how much time I'm going to be able to invest here… But maybe I was able to provide some food for thought.