Closed nevrome closed 9 months ago
Attention: 42 lines
in your changes are missing coverage. Please review.
Comparison is base (
46c7114
) 68.63% compared to head (b7acba7
) 68.32%. Report is 4 commits behind head on master.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
As indicated above this PR introduces a number of changes to improve the error reporting for broken .janno files. @AyGhal recently contacted me about various error messages she encountered when preparing packages for the PCA. Some of them were not terribly useful. So here's what I did to mitigate this issue:
When cassava encounters a parsing issue in an individual .janno file field it returns an error. For many more specific fields with custom types we created smart constructors with explicit error reporting (called by the respective parseField
functions). For them the error messages are good and useful. But for types without this additional specification, e.g. simple Int
or Double
fields, the messages are very simple, e.g.:
parse error (Failed reading: conversion error: expected Int, got "430;" (incomplete field parse, leftover: [59]))
Especially the list of ASCII codes for the unparsed leftover is disappointing. cassava unfortunately does not expose the components of this message in a useful error type. It only returns a string. I thus went ahead and parsed this string to compile something slightly more readable:
parse error in one column (expected data type: Int, broken value: "430;", problematic characters: ";")
A pretty disgusting solution and incomplete at best (I don't see how to get the column name into this message), but a bit better than what we had so far.
A problem that was raised multiple times by @AyGhal was the exact behaviour of the consistency check for radiocarbon dates in checkC14ColsConsistent
. I reworked that now into a tree of allowed and forbidden cases, whose leaves feature clear error messages (beyond the pretty useless The Date_* columns are not consistent
). For example if the the Date_Type
is not set to C14
, but we have radiocarbon date information in the data columns, we now get this:
Date_Type is not "C14", but either Date_C14_Uncal_BP or Date_C14_Uncal_BP_Err are not empty.
The check's logic was changed in this process and I would like to ask you to take a close look into it in the review. I implemented the same system for Contamination_*
and the Relation_*
columns, but the validation logic is much simpler there.
The most interesting/controversial change, finally, is the way these consistency checks are called. So far they have been part of an Either
based system that could only fail or forward the valid .janno row. checkJannoRowConsistency
and the functions below it now live in a little monad stack of Except
and Writer
, which allows to record both failure and (!) warnings.
type JannoRowLog = E.ExceptT String (W.Writer JannoRowWarnings)
I created this, because I encountered a critical omission in checkC14ColsConsistent
, which allowed the Date_Type
to be set to C14
even if Date_C14_Uncal_BP
and Date_C14_Uncal_BP_Err
are empty. Two packages in the PCA (2022_Gretzinger_AngloSaxons
and 2020_Nagele_Caribbean
) feature this invalid case and this brings us exactly in the kind of trouble discussed in #276. Through the new monad stack I now report only a warning for these packages instead of refusing to load them, thus propagating solution 1. mentioned there. What do you think?
I also quickly checked SequencingSource.hs
and removed some unused code there to keep the two modules for .janno and .ssf files somewhat consistent.
Wow, OK, I will need some time to review this. Happy to discuss priorities, but I can see how this will be a useful addition!
@stschiff Just a quick reminder that I'm waiting for a review here :+1:
On it! Likely not finish today though.
... and other ideas to make modifying .janno files less painful with better validation.
Work in progress. Not sure if all of these ideas are useful.