haskell-hvr / cassava

A CSV parsing and encoding library optimized for ease of use and high performance
http://hackage.haskell.org/package/cassava
BSD 3-Clause "New" or "Revised" License
222 stars 105 forks source link

[ #206 ] Deriving FromField/ToField instances #207

Open stevladimir opened 2 years ago

stevladimir commented 2 years ago

Works only for the following representations:

stevladimir commented 2 years ago

Hello! Any thoughts on this?

stevladimir commented 2 years ago

Friendly ping

andreasabel commented 2 years ago

Sorry for the slow reaction.

I rebased this onto master to get CI to run. The testsuite has (trivial) problems with the legacy GHCs (< 7.10).

tests/UnitTests.hs:485:30: Not in scope: ‘pure’
stevladimir commented 2 years ago

Adding 'pure' to imports fixed the problem

andreasabel commented 2 years ago

Adding 'pure' to imports fixed the problem

I moved this to the conditional import of Control.Applicative to avoid the unused import warning on new GHCs.

andreasabel commented 2 years ago

I see your parser supports untagged unions via <|>. Backtracking can be costly, so I do not know if this is in the spirit of this package: https://github.com/haskell-hvr/cassava/blob/fd168df89805fb0f66d9a8aab92abeb157a300c3/README.md?plain=1#L78-L86

I added a performance test for parsing unions that should take O(2ˆn) wrong branches until it succeeds. I only played it to n=10, and there no problem is visible to the naked eye. However, maybe we should have more serious benchmarking, or drop this feature. Opinions?

Also, could you draft a CHANGELOG entry describing the new features?

stevladimir commented 2 years ago

I totally agree that performance is important, so will add benchmarks.

Though I don't think extreme cases like you provided are a real problem unless there is a significant performance loss that is "unexpected" and can be avoided. As example of "unexpected" performance loss: parsing of enum happens to be O(n^2) instead of O(n), where n is the number of constructors. But otherwise parsing of (Either Foo (Either Bar (Either Baz (...)))) can't be super fast in its nature. I would say there is an issue if there are cases where we have complexity worse than O(n) with respect to the number of "branches".

Encoding for sums with unary constructors is somewhat fishy for another reason similar to UntaggedValue:

Constructor names won't be encoded. Instead only the contents of the constructor will be encoded as if the type had a single constructor. JSON encodings have to be disjoint for decoding to work properly.

When decoding, constructors are tried in the order of definition. If some encodings overlap, the first one defined will succeed.

This may be confusing but IMHO parsing sums with unary constructors is useful, e.g. data Limit = NoLimit | Limit Int | Unlimited. And IMHO it is acceptable to put obligation on a user to ensure that encodings of constructors are disjoint. It's still less tedious and less error prone than writing instances manually :) But it is up to you as a maintainer to make a decision.

Let me add some benchmarks and than we can come back to the discussion. Any ideas of what else to benchmark beside for "branched" unions?

Also, could you draft a CHANGELOG entry describing the new features?

Sure thing, but I think it makes sense to do after we resolve opened question.

andreasabel commented 2 years ago

I agree to your argumentation. Banning untagged unions altogether just because there are problematic cases would throw out the baby with the bathwater. So, let's include them.

stevladimir commented 1 year ago

TODO

stevladimir commented 1 year ago

I cleaned up TODO list above, so duplicating myself here

Sorry for long pause - I digged into the rabbit hole of performance :) Also was somewhat limited in time.

The original implementation turned to be 10x slower than manually written instance for enums. I managed to make generic instances almost as performant. I also added Nix shell as it is easier to switch GHCs with it. Let it be here for a while - if you think it is valuable it is better to make a separate PR IMO.

Below is the most recent set of benchmarks. I hope filenames are self descriptive (let me know if something is unclear). '-dumb-errors-' is the result of the following patch applied:

@@ -775,8 +775,9 @@ genericParseField
 genericParseField opts field = Parser $ \onFailure onSuccess ->
   unParser (gParseField opts field) (\_ -> onFailure err) (onSuccess . to . M1)
   where
-    err = "Can't parseField of type " <> datatypeName (Proxy :: Proxy meta d f)
-      <> " from " <> show field
+    err = "fail"
+    -- err = "Can't parseField of type " <> datatypeName (Proxy :: Proxy meta d f)
+    --   <> " from " <> show field
 {-# INLINE genericParseField #-}

 -- | A type that can be converted to a single CSV field.
@@ -1405,7 +1406,7 @@ instance (Constructor c) => GFromField (C1 c U1) where
   gParseField opts field = Parser $ \onFailure onSuccess ->
     if field == expected
     then onSuccess val
-    else onFailure $ "Can't parse " <> show expected <> " from " <> show field
+    else onFailure "Fail" -- $ "Can't parse " <> show expected <> " from " <> show field
     where
       expected = encodeConstructor opts val
       val :: C1 c U1 p

Llist of open questions I need your opinion about (much duplicates TODO):

  1. Anything else to be benchmarked?
  2. Error reporting My original implementation tries to be informative about how parsing failed, i.e. fail $ "Can't parse " <> show expected <> " from " <> show field. Replacing that with just empty makes fail case ~10x faster (see benchmark-iut-dumb-errors benchmarks below) and even slightly more performant than manual instance (I do suspect that this is because generic instance is implemented using Parser constructor, which is otherwise hidden, so hand written instances have to use pure/fail interface instead of CPSed Parser \onFailure onSuccess). What is more preferable in this case: performance or informativeness? We can also make this customizable via Cabal flag or Options, so can have both and let end-user to choose.
  3. Would it make sense to introduce a separate FieldOptions type or Options.constructorTagModifier field? It feels not intuitive that fieldLabelModifier is used for constructors. I initially used existing Options for simplicity, but IMO it should not be merged as is. I can see few options:
    1. New structure FieldOptions (I would go with this option). In that case I would also rename Options to RecordOptions. To be backward compatible we can:
      • type Options = RecordOptions and {-# DEPRECATED Options "Use 'RecordOptions'" #-}
      • pattern Options and deprecate it too (not sure about type/constructor name clashes in DEPRECATED pragmas)
    2. Options.constructorTagModifier field
    3. Just leave as is (me strongly opposed)

Few side notes (posting here not to forget to discuss separately whether it makes sense to investigate further):

  1. I would consider splitting tests into modules as that generally makes compilation faster. But that's a separate PR.
  2. Explore and probably expose Parser constructor or add respective combinator to get the most of the performance.
  3. Also I've noticed some performance difference between benchmark-ref and benchmark-iut (the latter being often slightly or moderately, like in genericParseFail bench, more performant). IIUC benchmark-ref models what end-user sees, while benchmark-iut uses cassava as internal library. I can assume that's because in the latter case -O2 is applied to both cassava-iut library and benchmark executable, so some internal things are better optimized. But maybe you have better understanding of what is going on.

With best regards!

stevladimir commented 1 year ago
Debian 10, Intel(R) Core(TM) i7-8750H (12 core, 6 physical), GHC 9.0.2

benchmark-iut.csv benchmark-iut.pdf benchmark-iut-dumb-errors.csv benchmark-iut-dumb-errors.pdf benchmark-ref.csv benchmark-ref.pdf benchmark-ref-dumb-errors.csv benchmark-ref-dumb-errors.pdf

stevladimir commented 1 year ago

Also I've noticed some performance difference between benchmark-ref and benchmark-iut (the latter being often slightly or moderately, like in genericParseFail bench, more performant). IIUC benchmark-ref models what end-user sees, while benchmark-iut uses cassava as internal library. I can assume that's because in the latter case -O2 is applied to both cassava-iut library and benchmark executable, so some internal things are better optimized. But maybe you have better understanding of what is going on.

I've carefully revisited things and now almost sure that my original understanding was wrong. The purpose of benchmark-ref seems to be a reference implementation (yeah, ref in its name), which always uses cassava from Hackage (or whatever is installed in the environment). And benchmark-iut is to test actual changes done locally, so one can easily compare performance changes.

Beside for that I've added generic instances for uninhabited types (Data.Void.Void is an example of such types - added instance for it too).

Just to summarize - supported representations:

-- 1. Uninhabited types: implementation is absurd, but it is expected that such terms never exist.
-- Still they may be useful when one needs to statically guarantee non-existence of term:
-- e.g., "Maybe Void" can exist only as "Nothing" (ignoring bottom) 
-- An example from practice
data Foo (p :: Bool) =  Foo
  { foo :: Maybe (Restricted p)
  , ... }
type family Restricted (p :: Bool) where
  Restricted True = Int
  Restricted False = Void
-- It's ofc a simplified example and this could be modeled with passing type explicitly,
-- but that is not always feasible for other reasons.

-- 2. Nullary constructor types
data Foo0 = Foo0 
-- toField Foo0 == "Foo0"

-- 3. Unary constructor types
-- This was absent in original implementation, but unlike with JSON there is no ambiguity,
-- so see no sense to not provide instance for them.
-- In JSON `data Foo a = Foo{foo :: a}` can be reasonably encoded
-- both as `toJSON a` and `object ["foo" .= toJSON a]`.
newtype Foo1 a = Foo1 a
-- toField (Foo1 a) == toField a

-- 4. Sum with branches fitting into 2. or 3.
data FooSum a = FooA | FooB a
-- toField FooA == "FooA"
-- toField (FooB a) == toField a
stevladimir commented 1 year ago

Just in case: I'm waiting for some feedback

stevladimir commented 8 months ago

Friendly ping