poseidon-framework / poseidon-hs

A toolset to work with modular genotype databases in the Poseidon format
https://poseidon-framework.github.io/#/trident
MIT License
7 stars 2 forks source link

Improve .janno parsing #307

Closed nevrome closed 4 days ago

nevrome commented 2 months ago

When working through all these PRs in the community archive I realized that the .janno parsing certainly needs more informative error messages. In this PR I would like to propose a better and more transparent solution.

Here's a summary of what I did:

I did not introduce additional validation here, but this new setup makes it very easy to do so by modifying the make function of a given type. make only lives in MonadFail, so additional checks that should only yield a warning must be done in checkJannoRowConsistency (as in the past).

I did not adjust the type for the .ssf file (SeqSourceRow) yet. It's not so urgent, imho. But my ideas was that its columns could also be added in ColumnTypes eventually.


Now what do we get out of this change?

To demonstrate this I took the 2012_MeyerScience.janno file and broke some of it's columns:

Here's what trident returns without this patch:

[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 2:
parse error (Failed reading: conversion error: Cannot decode byte '\x80': Data.Text.Encoding: Invalid UTF-8 stream)
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 3:
parse error in one column (expected data type: Double, broken value: "32.12;", problematic characters: ";")
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 7:
parse error (Failed reading: conversion error: expected Double, got "x18.93726" (Failed reading: takeWhile1))

And now with the changes here:

[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 2:
parse error (Failed reading: conversion error: Cannot decode byte '\x80': Data.Text.Encoding: Invalid UTF-8 stream in column Relation_Note)
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 3:
parse error (Failed reading: conversion error: Coverage_on_Target_SNPs can not be converted to Double, because of a trailing ";")
[Error]   Can't read sample in 2012_MeyerScience/2012_MeyerScience2.csv in line 7:
parse error (Failed reading: conversion error: Latitude can not be converted to Double because input does not start with a digit)

Most importantly the error messages now include the relevant column name. They are also more concrete and more easy to understand.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 53.89610% with 213 lines in your changes missing coverage. Please review.

Project coverage is 59.56%. Comparing base (5306ba0) to head (5406a3b). Report is 35 commits behind head on master.

Files with missing lines Patch % Lines
src/Poseidon/ColumnTypes.hs 42.98% 70 Missing and 121 partials :warning:
src/Poseidon/CLI/Summarise.hs 36.36% 0 Missing and 7 partials :warning:
src/Poseidon/SequencingSource.hs 79.41% 3 Missing and 4 partials :warning:
src/Poseidon/ColumnTypesUtils.hs 28.57% 5 Missing :warning:
src/Poseidon/Janno.hs 96.92% 1 Missing and 1 partial :warning:
src/Poseidon/Package.hs 85.71% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #307 +/- ## ========================================== - Coverage 60.40% 59.56% -0.85% ========================================== Files 27 29 +2 Lines 4031 4078 +47 Branches 409 484 +75 ========================================== - Hits 2435 2429 -6 + Misses 1187 1165 -22 - Partials 409 484 +75 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

nevrome commented 1 month ago

This must feel like a nightmare to review, but I think it looks worse than it is. I only worked in the Janno and the new ColumnTypes(Utils) modules. The other adjustments are just fallout from that.

Open questions on my end are:

stschiff commented 1 month ago

OK, thanks. I think I'm not too sceptical from the outset. Let me have a look.

stschiff commented 2 weeks ago

I think I would like to put reviewing this on short hold until we've made a decision about #309. Whether we want to remove the AESON instances is also worth a short discussion. We can do so, and at this point the whole template-haskell bit might become a bit obsolete, because there aren't that many type classes then anymore to automatically fill. But happy to discuss, we can also just do it the way you've done now.

I will be on vacation next week, and will pick this up again on Oct 21st.

stschiff commented 2 weeks ago

And I just wanted to say that from a user-perspective this PR is a huge improvement. I have already made use of it when I prepared a recent Janno file and used this version to point me to errors!

nevrome commented 1 week ago

Yes - I found myself also immediately switching to this branch for all practical work with the data. It makes spotting errors easier.

Whether we want to remove the AESON instances is also worth a short discussion.

What do you mean? We need the FromJSON and ToJSON instances for the server-client interaction here: https://github.com/poseidon-framework/poseidon-hs/blob/5306ba0a07bbb1bfae41225a834219a2687b65a3/src/Poseidon/ServerClient.hs#L77-L112 This code does not compile if we remove a single instance definition for any of the janno types. Or is there an easy way to refactor this into a Aeson-free solution?

stschiff commented 5 days ago

Hmm, interesting. I think the constructor ApiReturnJanno must have been added pre-emptively. It is not actually used anywhere. Ever since I have added the option to transfer any additional Janno-columns as untyped Strings together with the Individual-Info, we dropped the idea of adding a dedicated Janno-Return API I think. But good catch!

I have just checked: If I comment out the lines with these constructors in it, the code compiles just fine. So these were really added with no client actually using it.

I don't know whether we should go that step and remove all these Aeson-instances. They are definitely not used right now, and our JSON transfer just uses the Cassava Csv.ToField instances to serialise fields.

I guess we can remove them and then also remove this unused feature from the ServerClient.