Open stevladimir opened 2 years ago
Hello! Any thoughts on this?
Friendly ping
Sorry for the slow reaction.
I rebased this onto master
to get CI to run.
The testsuite has (trivial) problems with the legacy GHCs (< 7.10).
tests/UnitTests.hs:485:30: Not in scope: ‘pure’
Adding 'pure' to imports fixed the problem
Adding 'pure' to imports fixed the problem
I moved this to the conditional import of Control.Applicative
to avoid the unused import warning on new GHCs.
I see your parser supports untagged unions via <|>
. Backtracking can be costly, so I do not know if this is in the spirit of this package: https://github.com/haskell-hvr/cassava/blob/fd168df89805fb0f66d9a8aab92abeb157a300c3/README.md?plain=1#L78-L86
I added a performance test for parsing unions that should take O(2ˆn) wrong branches until it succeeds. I only played it to n=10
, and there no problem is visible to the naked eye. However, maybe we should have more serious benchmarking, or drop this feature. Opinions?
Also, could you draft a CHANGELOG entry describing the new features?
I totally agree that performance is important, so will add benchmarks.
Though I don't think extreme cases like you provided are a real problem unless there is a significant performance loss that is "unexpected" and can be avoided. As example of "unexpected" performance loss: parsing of enum happens to be O(n^2) instead of O(n), where n
is the number of constructors. But otherwise parsing of (Either Foo (Either Bar (Either Baz (...))))
can't be super fast in its nature. I would say there is an issue if there are cases where we have complexity worse than O(n) with respect to the number of "branches".
Encoding for sums with unary constructors is somewhat fishy for another reason similar to UntaggedValue:
Constructor names won't be encoded. Instead only the contents of the constructor will be encoded as if the type had a single constructor. JSON encodings have to be disjoint for decoding to work properly.
When decoding, constructors are tried in the order of definition. If some encodings overlap, the first one defined will succeed.
This may be confusing but IMHO parsing sums with unary constructors is useful, e.g. data Limit = NoLimit | Limit Int | Unlimited
. And IMHO it is acceptable to put obligation on a user to ensure that encodings of constructors are disjoint. It's still less tedious and less error prone than writing instances manually :) But it is up to you as a maintainer to make a decision.
Let me add some benchmarks and than we can come back to the discussion. Any ideas of what else to benchmark beside for "branched" unions?
Also, could you draft a CHANGELOG entry describing the new features?
Sure thing, but I think it makes sense to do after we resolve opened question.
I agree to your argumentation. Banning untagged unions altogether just because there are problematic cases would throw out the baby with the bathwater. So, let's include them.
Either
.fail $ "Can't parse " <> show expected <> " from " <> show field
.
Replacing that with just empty
makes fail case ~10x faster (see *-dumb-errors-*
benchmark below) and even slightly more performant than manual instance (I do suspect that this is because generic instance is implemented using Parser constructor, which is otherwise hidden, so hand written instances have to use pure/fail interface instead of CPSed Parser \onFailure onSuccess
).
What is more preferable in this case: performance or informativeness? We can also make this customizable via Cabal flag or Options
, so can have both and let end-user to choose.FieldOptions
type or Options.constructorTagModifier
field? It feels not intuitive that fieldLabelModifier
is used for constructors. I initially used existing Options
for simplicity, but IMO it should not be merged as is. I can see few options:
FieldOptions
(I would go with this option).
In that case I would also rename Options
to RecordOptions
. To be backward compatible we can:
type Options = RecordOptions
and {-# DEPRECATED Options "Use 'RecordOptions'" #-}
pattern Options
and deprecate it too (not sure about type/constructor name clashes in DEPRECATED pragmas)Options.constructorTagModifier
fieldI cleaned up TODO list above, so duplicating myself here
Sorry for long pause - I digged into the rabbit hole of performance :) Also was somewhat limited in time.
The original implementation turned to be 10x slower than manually written instance for enums. I managed to make generic instances almost as performant. I also added Nix shell as it is easier to switch GHCs with it. Let it be here for a while - if you think it is valuable it is better to make a separate PR IMO.
Below is the most recent set of benchmarks. I hope filenames are self descriptive (let me know if something is unclear). '-dumb-errors-' is the result of the following patch applied:
@@ -775,8 +775,9 @@ genericParseField
genericParseField opts field = Parser $ \onFailure onSuccess ->
unParser (gParseField opts field) (\_ -> onFailure err) (onSuccess . to . M1)
where
- err = "Can't parseField of type " <> datatypeName (Proxy :: Proxy meta d f)
- <> " from " <> show field
+ err = "fail"
+ -- err = "Can't parseField of type " <> datatypeName (Proxy :: Proxy meta d f)
+ -- <> " from " <> show field
{-# INLINE genericParseField #-}
-- | A type that can be converted to a single CSV field.
@@ -1405,7 +1406,7 @@ instance (Constructor c) => GFromField (C1 c U1) where
gParseField opts field = Parser $ \onFailure onSuccess ->
if field == expected
then onSuccess val
- else onFailure $ "Can't parse " <> show expected <> " from " <> show field
+ else onFailure "Fail" -- $ "Can't parse " <> show expected <> " from " <> show field
where
expected = encodeConstructor opts val
val :: C1 c U1 p
Llist of open questions I need your opinion about (much duplicates TODO):
fail $ "Can't parse " <> show expected <> " from " <> show field
.
Replacing that with just empty makes fail case ~10x faster (see benchmark-iut-dumb-errors
benchmarks below) and even slightly more performant than manual instance (I do suspect that this is because generic instance is implemented using Parser constructor, which is otherwise hidden, so hand written instances have to use pure/fail interface instead of CPSed Parser \onFailure onSuccess).
What is more preferable in this case: performance or informativeness? We can also make this customizable via Cabal flag or Options, so can have both and let end-user to choose.FieldOptions
type or Options.constructorTagModifier
field? It feels not intuitive that fieldLabelModifier
is used for constructors. I initially used existing Options
for simplicity, but IMO it should not be merged as is. I can see few options:
FieldOptions
(I would go with this option).
In that case I would also rename Options
to RecordOptions
. To be backward compatible we can:
type Options = RecordOptions
and {-# DEPRECATED Options "Use 'RecordOptions'" #-}
pattern Options
and deprecate it too (not sure about type/constructor name clashes in DEPRECATED pragmas)Options.constructorTagModifier
fieldFew side notes (posting here not to forget to discuss separately whether it makes sense to investigate further):
benchmark-ref
and benchmark-iut
(the latter being often slightly or moderately, like in genericParseFail bench, more performant). IIUC benchmark-ref
models what end-user sees, while benchmark-iut
uses cassava as internal library.
I can assume that's because in the latter case -O2 is applied to both cassava-iut
library and benchmark executable, so some internal things are better optimized. But maybe you have better understanding of what is going on.With best regards!
Debian 10, Intel(R) Core(TM) i7-8750H (12 core, 6 physical), GHC 9.0.2
benchmark-iut.csv benchmark-iut.pdf benchmark-iut-dumb-errors.csv benchmark-iut-dumb-errors.pdf benchmark-ref.csv benchmark-ref.pdf benchmark-ref-dumb-errors.csv benchmark-ref-dumb-errors.pdf
Also I've noticed some performance difference between benchmark-ref and benchmark-iut (the latter being often slightly or moderately, like in genericParseFail bench, more performant). IIUC benchmark-ref models what end-user sees, while benchmark-iut uses cassava as internal library. I can assume that's because in the latter case -O2 is applied to both cassava-iut library and benchmark executable, so some internal things are better optimized. But maybe you have better understanding of what is going on.
I've carefully revisited things and now almost sure that my original understanding was wrong. The purpose of benchmark-ref
seems to be a reference implementation (yeah, ref
in its name), which always uses cassava
from Hackage (or whatever is installed in the environment). And benchmark-iut
is to test actual changes done locally, so one can easily compare performance changes.
Beside for that I've added generic instances for uninhabited types (Data.Void.Void
is an example of such types - added instance for it too).
Just to summarize - supported representations:
-- 1. Uninhabited types: implementation is absurd, but it is expected that such terms never exist.
-- Still they may be useful when one needs to statically guarantee non-existence of term:
-- e.g., "Maybe Void" can exist only as "Nothing" (ignoring bottom)
-- An example from practice
data Foo (p :: Bool) = Foo
{ foo :: Maybe (Restricted p)
, ... }
type family Restricted (p :: Bool) where
Restricted True = Int
Restricted False = Void
-- It's ofc a simplified example and this could be modeled with passing type explicitly,
-- but that is not always feasible for other reasons.
-- 2. Nullary constructor types
data Foo0 = Foo0
-- toField Foo0 == "Foo0"
-- 3. Unary constructor types
-- This was absent in original implementation, but unlike with JSON there is no ambiguity,
-- so see no sense to not provide instance for them.
-- In JSON `data Foo a = Foo{foo :: a}` can be reasonably encoded
-- both as `toJSON a` and `object ["foo" .= toJSON a]`.
newtype Foo1 a = Foo1 a
-- toField (Foo1 a) == toField a
-- 4. Sum with branches fitting into 2. or 3.
data FooSum a = FooA | FooB a
-- toField FooA == "FooA"
-- toField (FooB a) == toField a
Just in case: I'm waiting for some feedback
Friendly ping
Works only for the following representations: