Improve .janno parsing - Githubissues

nevrome commented 2 months ago

When working through all these PRs in the community archive I realized that the .janno parsing certainly needs more informative error messages. In this PR I would like to propose a better and more transparent solution.

Here's a summary of what I did:

Introduced individual Janno... types for every janno column (except Poseidon_ID) in a new module ColumnTypes.
Each type is an instance of Show, ToField, FromField (cassava), ToJSON, FromJSON (Aeson) and two custom typeclasses HasColName and Makeable.
Makeable has a function make :: MonadFail m => T.Text -> m a that can be used for very precises input validation.
These typeclasses are defined in a new module ColumnTypesUtils together with a bit template haskell code (makeInstances) to reduce the amount of boilerplate for simple text columns.
I switched to Text for the string types and as an intermediate format, so that we can reliably check for non UTF-8 characters with T.decodeUtf8'.
The general .csv field parsing sequence is now as follows: Transform Bytestring to UTF-8 encoded Text and fail upon exceptions. Then transform Text to the desired type with a make... constructor function. This function does additional validation and fails if the checks can not be satisfied.

I did not introduce additional validation here, but this new setup makes it very easy to do so by modifying the make function of a given type. make only lives in MonadFail, so additional checks that should only yield a warning must be done in checkJannoRowConsistency (as in the past).

I did not adjust the type for the .ssf file (SeqSourceRow) yet. It's not so urgent, imho. But my ideas was that its columns could also be added in ColumnTypes eventually.

Now what do we get out of this change?

To demonstrate this I took the 2012_MeyerScience.janno file and broke some of it's columns:

I added some non-UTF8 encoded characters in the Relation_Note column of line 2.
I added a trailing ; in Coverage_on_Target_SNPs of line 3.
I added a leading x to the Latitude column of line 7.

Here's what trident returns without this patch:

[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 2:
parse error (Failed reading: conversion error: Cannot decode byte '\x80': Data.Text.Encoding: Invalid UTF-8 stream)
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 3:
parse error in one column (expected data type: Double, broken value: "32.12;", problematic characters: ";")
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 7:
parse error (Failed reading: conversion error: expected Double, got "x18.93726" (Failed reading: takeWhile1))

And now with the changes here:

[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 2:
parse error (Failed reading: conversion error: Cannot decode byte '\x80': Data.Text.Encoding: Invalid UTF-8 stream in column Relation_Note)
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 3:
parse error (Failed reading: conversion error: Coverage_on_Target_SNPs can not be converted to Double, because of a trailing ";")
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 7:
parse error (Failed reading: conversion error: Latitude can not be converted to Double because input does not start with a digit)

Most importantly the error messages now include the relevant column name. They are also more concrete and more easy to understand.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 53.89610% with 213 lines in your changes missing coverage. Please review.

Project coverage is 59.56%. Comparing base (5306ba0) to head (5406a3b). Report is 35 commits behind head on master.

Files with missing lines	Patch %	Lines
src/Poseidon/ColumnTypes.hs	42.98%	70 Missing and 121 partials :warning:
src/Poseidon/CLI/Summarise.hs	36.36%	0 Missing and 7 partials :warning:
src/Poseidon/SequencingSource.hs	79.41%	3 Missing and 4 partials :warning:
src/Poseidon/ColumnTypesUtils.hs	28.57%	5 Missing :warning:
src/Poseidon/Janno.hs	96.92%	1 Missing and 1 partial :warning:
src/Poseidon/Package.hs	85.71%	0 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #307 +/- ## ========================================== - Coverage 60.40% 59.56% -0.85% ========================================== Files 27 29 +2 Lines 4031 4078 +47 Branches 409 484 +75 ========================================== - Hits 2435 2429 -6 + Misses 1187 1165 -22 - Partials 409 484 +75 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

nevrome commented 1 month ago

This must feel like a nightmare to review, but I think it looks worse than it is. I only worked in the Janno and the new ColumnTypes(Utils) modules. The other adjustments are just fallout from that.

Open questions on my end are:

Is this entire infrastructure too verbose? One type for each column allows for really good control, but it also introduces a lot of repetitive code.
Is the template haskell a good idea? It noticeably increases compilation time for everything upstream of Janno, but it also clearly reduces the amount of code. And we already used template haskell, so this is not a new dependency.
I did not introduce a custom newtype wrapper for Poseidon_ID, because it would be a pain to realize this change across the code base. The one for Group_Name is already questionable, but I wanted the new, reliable UTF-8 validation for the group names.
I wonder if there are unforeseen consequences of this patch, e.g. for xerxes. I think it shouldn't affect things too much, and always in a very predictable manner. Having such a wealth of precise types sharpens the eyes of the compiler.

stschiff commented 1 month ago

OK, thanks. I think I'm not too sceptical from the outset. Let me have a look.

stschiff commented 2 weeks ago

I think I would like to put reviewing this on short hold until we've made a decision about #309. Whether we want to remove the AESON instances is also worth a short discussion. We can do so, and at this point the whole template-haskell bit might become a bit obsolete, because there aren't that many type classes then anymore to automatically fill. But happy to discuss, we can also just do it the way you've done now.

I will be on vacation next week, and will pick this up again on Oct 21st.

stschiff commented 2 weeks ago

And I just wanted to say that from a user-perspective this PR is a huge improvement. I have already made use of it when I prepared a recent Janno file and used this version to point me to errors!

nevrome commented 1 week ago

Yes - I found myself also immediately switching to this branch for all practical work with the data. It makes spotting errors easier.

Whether we want to remove the AESON instances is also worth a short discussion.

What do you mean? We need the FromJSON and ToJSON instances for the server-client interaction here: https://github.com/poseidon-framework/poseidon-hs/blob/5306ba0a07bbb1bfae41225a834219a2687b65a3/src/Poseidon/ServerClient.hs#L77-L112 This code does not compile if we remove a single instance definition for any of the janno types. Or is there an easy way to refactor this into a Aeson-free solution?

stschiff commented 5 days ago

Hmm, interesting. I think the constructor ApiReturnJanno must have been added pre-emptively. It is not actually used anywhere. Ever since I have added the option to transfer any additional Janno-columns as untyped Strings together with the Individual-Info, we dropped the idea of adding a dedicated Janno-Return API I think. But good catch!

I have just checked: If I comment out the lines with these constructors in it, the code compiles just fine. So these were really added with no client actually using it.

I don't know whether we should go that step and remove all these Aeson-instances. They are definitely not used right now, and our JSON transfer just uses the Cassava Csv.ToField instances to serialise fields.

I guess we can remove them and then also remove this unused feature from the ServerClient.

poseidon-framework / poseidon-hs

Improve .janno parsing #307

Codecov Report