Open mberndt123 opened 4 months ago
@mberndt123 Thanks for the PR! I now had a closer look at your proposal, see inline comments below. Overall, there are some interesting ideas, but also some points where I disagree or am unsure about the reasoning. Let's discuss :)
- a simpler API that is easier to understand. By this I primarily mean removing many of the type parameters
The type params were there for reasons. Your changes fix the header type to String
which is less flexible than the current design that allows headers to be any type that could be handled by ParseableHeader
. Whether that's a capability that we need can be discussed, but I'd argue that the simplicity win is not that big because users that wanted a simple "read this CSV to a stream of case class instances" or alike didn't have to care about the header type param so far, derivation just did its thing for String
headers.
- a more compositional API. Users shouldn't need to implement traits like
CsvRowEncoder
themselves. Instead, there should be a few small, builtin primitive instances and a suitable set of combinators to combine these into larger ones. IdeallyCsvRowDecoder
& friends would besealed
What benefit do you see by sealing them? I can agree to the part of providing nice combinators to create instances, but fail to see the gain of forbidding users to write their own instances if they have special needs.
- I think we can achieve better performance. It isn't necessary to build a
CsvRow
with field names every time we parse a row, the fields can be accessed by index.NonEmptyList
is also an inefficient representation and we can useArray
instead. Given the new sealed, compositional API, we probably don't even need to expose theCsvRow
representation to the user at all, giving us much more flexibility for future evolution of the library
NonEmptyList
is as efficient as List
which the PR uses and has the significant advantage of being more correct in terms of semantics (an empty row makes no sense in CSV). fs2-data focuses on correctness and maintainable code, performance is optimised second, even if that makes for a bit slower results. If Array
can be a pure implementation detail, we can consider it, but usually the required ClassTag
makes this impossible/come at a cost.
- derivation should be usable with Scala 3's
derives
keyword
👍 Agreed. It's on my Wishlist for 2.0, too.
- fail-fast for header-based CSV decoding. If a necessary column is missing, the decoder should fail immediately when parsing the header, not later when parsing the individual rows
I have to think more about this one. Many use cases won't experience a difference as failing on the headers instead of the first row is little change. It also is limiting, imagine the following dummy example:
enum Data {
case A(foo: String) // we could also have some discriminator field in both cases, not just distinct fields
case B(bar: Int)
}
If a CSV only contained a column foo
, the current approach would parse all rows just fine as Data.A
. If the CSV decoder instance had to state upfront which columns are required, what would you imagine this statement to be? And could the file still be parsed?
Also keep in mind columns can be purely optional as of now, as long as decoder knows how to handle this (case class default values, for example), this is just fine.
- for the
CsvRowEncoder
, it should be possible to generate the header without having any actual data available. The current model is problematic when serializing algebraic data types (sealed traits)
Yeah, sounds reasonable. Would you mind elaborating on your statement on ADTs though?
Hi @ybasket, thank you for your review, and apologies for how long it has taken me to respond.
Your changes fix the header type to String which is less flexible than the current design
The choice of String
isn't some arbitrary restriction, it is inherent in the domain we're working in. CSV headers are strings, and so are the names of case class fields, which is why the typeclass derivation stuff can only generate the typeclass instances for CSV files with String
headers. That makes String
special enough that generalizing over it needs some justification, and I just can't see that – I can't for the life of me think of a scenario where I would need CSV headers of any type other than String
. OTOH it does make the signatures quite a bit harder to read, so I just don't think it's worth it.
The only other library I know that does something similar is Circe, but I would argue that's a different case because JSON objects are also used as a representation of Map
, and it is common to have Map
s with non-String keys. But CSV can't really represent a Map
because the rows are a fixed size.
What benefit do you see by sealing them? I can agree to the part of providing nice combinators to create instances, but fail to see the gain of forbidding users to write their own instances if they have special needs.
It exposes what could be completely opaque implementation details to users, like the representation of a Row. For instance, if you change the type from NonEmptyList
to Array
(because List
is a terrible data structure in terms of performance), that is an incompatible change right now. If you use a sealed trait (and make the methods private), you're free to change this because the user will only be able to create instances using restricted factory methods that are easy to keep compatible.
It also makes it easier to discover the best way to use the library, which is building larger parsers from smaller ones using combinators. I have seen many instances where people were manually implementing interfaces (e. g. io.circe.Decoder
) when a much better way would have been to use suitable combinators. A good library interface is one that steers users towards the best way of using it.
NonEmptyList
is as efficient asList
which the PR uses and has the significant advantage of being more correct in terms of semantics (an empty row makes no sense in CSV).
An empty CSV row makes as much sense as the number 0, the type Nothing
, the identity
function etc: it serves as a neutral element of composition, which is why it is required to make RowEncoder
into an instance of ContravariantMonoidal
.
The only benefit of a NonEmptyList
is that you can call methods like head
or reduce
on it, and this isn't done anywhere in the fs2-data-csv codebase, which leads me to believe it's not very useful. I'd rather have the compositionality that a ContravariantMonoidal
instance gives me.
If Array can be a pure implementation detail, we can consider it, but usually the required ClassTag makes this impossible/come at a cost.
ClassTag
is only required when creating new Array
s of a generic type, therefore I don't think it's going to be an issue.
If a CSV only contained a column foo, the current approach would parse all rows just fine as Data.A. If the CSV decoder instance had to state upfront which columns are required, what would you imagine this statement to be? And could the file still be parsed? If that's a requirement then it wouldn't be hard to extend the
CsvRowDecoder
interface a bit:sealed trait CsvRowDecoder[T] { def apply(headers: List[String], delayErrors: Boolean): Either[DecoderError, RowDecoder[T]] }
If
delayErrors
is true, thenapply
would always return aRight
, even if the required headers can't be found, but theRowDecoder
inside theRight
would always fail with an error. Note that this is another change that can only be made in a backwards compatible way if the implementation details ofCsvRowDecoder
are hidden and only available through a more restrictive API.Yeah, sounds reasonable. Would you mind elaborating on your statement on ADTs though? It should probably be possible to compose the
CsvRowEncoder
of asealed trait
from theCsvRowEncoder
instances of the individual case classes. But if you encode acase class
using aCsvRowEncoder
in the current model, then I think it's not going to have all the column names of the other case classes' fields, right? I haven't used fs2-data-csv for any encoding or sealed traits, so I'm not 100% sure on this one and need to think it through better.
Unfortunately I'm somewhat short on time right now, so I'm not sure how much time I'm going to be able to invest here… But maybe I was able to provide some food for thought.
This PR is meant as a basis for discussion of the kinds of changes that I would like to see in fs2-data-csv 2.0.
My goals are:
CsvRowEncoder
themselves. Instead, there should be a few small, builtin primitive instances and a suitable set of combinators to combine these into larger ones. IdeallyCsvRowDecoder
& friends would besealed
CsvRow
with field names every time we parse a row, the fields can be accessed by index.NonEmptyList
is also an inefficient representation and we can useArray
instead. Given the new sealed, compositional API, we probably don't even need to expose theCsvRow
representation to the user at all, giving us much more flexibility for future evolution of the libraryderives
keywordCsvRowEncoder
, it should be possible to generate the header without having any actual data available. The current model is problematic when serializing algebraic data types (sealed traits)Here's what I've done so far:
NonEmptyList
usage – it doesn't bring any real benefits, but prevents us from writing a correctContravariantMonoidal
instance forCsvRowEncoder
CsvRowDecoder
/CsvRowEncoder
/RowDecoder
/RowEncoder
are now sealed, and suitable typeclass instances are provided to combine smaller encoders/decoders into larger ones. Specifically, there is now aContravariantMonoidal
instance for the*Encoder
type classesAll of this is nowhere near complete. For example, I think the
Applicative
instance forRowDecoder
can be made more compositional, and I'll be happy to discuss that in detail later. There is also noMonad
instance forCsvRowDecoder
yet, which is due to the fact that I'd like to completely process the CSV header up-front and never look at it again while parsing the CSV rows. It's possible to do this using Selective Applicative Functors.Alright, that should be enough to get the discussion started.