poseidon-framework / poseidon-hs

A toolset to work with modular genotype databases in the Poseidon format
https://poseidon-framework.github.io/#/trident
MIT License
7 stars 2 forks source link

More informative consistency checks #283

Closed nevrome closed 9 months ago

nevrome commented 11 months ago

... and other ideas to make modifying .janno files less painful with better validation.

Work in progress. Not sure if all of these ideas are useful.

codecov[bot] commented 11 months ago

Codecov Report

Attention: 42 lines in your changes are missing coverage. Please review.

Comparison is base (46c7114) 68.63% compared to head (b7acba7) 68.32%. Report is 4 commits behind head on master.

Files Patch % Lines
src/Poseidon/Janno.hs 53.24% 30 Missing and 6 partials :warning:
src/Poseidon/SequencingSource.hs 58.33% 4 Missing and 1 partial :warning:
src/Poseidon/Utils.hs 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #283 +/- ## ========================================== - Coverage 68.63% 68.32% -0.31% ========================================== Files 25 25 Lines 3341 3375 +34 Branches 382 376 -6 ========================================== + Hits 2293 2306 +13 - Misses 666 693 +27 + Partials 382 376 -6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

nevrome commented 11 months ago

As indicated above this PR introduces a number of changes to improve the error reporting for broken .janno files. @AyGhal recently contacted me about various error messages she encountered when preparing packages for the PCA. Some of them were not terribly useful. So here's what I did to mitigate this issue:

  1. When cassava encounters a parsing issue in an individual .janno file field it returns an error. For many more specific fields with custom types we created smart constructors with explicit error reporting (called by the respective parseField functions). For them the error messages are good and useful. But for types without this additional specification, e.g. simple Int or Double fields, the messages are very simple, e.g.:

    parse error (Failed reading: conversion error: expected Int, got "430;" (incomplete field parse, leftover: [59]))

    Especially the list of ASCII codes for the unparsed leftover is disappointing. cassava unfortunately does not expose the components of this message in a useful error type. It only returns a string. I thus went ahead and parsed this string to compile something slightly more readable:

    parse error in one column (expected data type: Int, broken value: "430;", problematic characters: ";")

    A pretty disgusting solution and incomplete at best (I don't see how to get the column name into this message), but a bit better than what we had so far.

  2. A problem that was raised multiple times by @AyGhal was the exact behaviour of the consistency check for radiocarbon dates in checkC14ColsConsistent. I reworked that now into a tree of allowed and forbidden cases, whose leaves feature clear error messages (beyond the pretty useless The Date_* columns are not consistent). For example if the the Date_Type is not set to C14, but we have radiocarbon date information in the data columns, we now get this:

    Date_Type is not "C14", but either Date_C14_Uncal_BP or Date_C14_Uncal_BP_Err are not empty.

    The check's logic was changed in this process and I would like to ask you to take a close look into it in the review. I implemented the same system for Contamination_* and the Relation_* columns, but the validation logic is much simpler there.

  3. The most interesting/controversial change, finally, is the way these consistency checks are called. So far they have been part of an Either based system that could only fail or forward the valid .janno row. checkJannoRowConsistency and the functions below it now live in a little monad stack of Except and Writer, which allows to record both failure and (!) warnings.

    type JannoRowLog = E.ExceptT String (W.Writer JannoRowWarnings)

    I created this, because I encountered a critical omission in checkC14ColsConsistent, which allowed the Date_Type to be set to C14 even if Date_C14_Uncal_BP and Date_C14_Uncal_BP_Err are empty. Two packages in the PCA (2022_Gretzinger_AngloSaxons and 2020_Nagele_Caribbean) feature this invalid case and this brings us exactly in the kind of trouble discussed in #276. Through the new monad stack I now report only a warning for these packages instead of refusing to load them, thus propagating solution 1. mentioned there. What do you think?

I also quickly checked SequencingSource.hs and removed some unused code there to keep the two modules for .janno and .ssf files somewhat consistent.

stschiff commented 11 months ago

Wow, OK, I will need some time to review this. Happy to discuss priorities, but I can see how this will be a useful addition!

nevrome commented 9 months ago

@stschiff Just a quick reminder that I'm waiting for a review here :+1:

stschiff commented 9 months ago

On it! Likely not finish today though.