Closed stschiff closed 8 months ago
Attention: Patch coverage is 64.51613%
with 33 lines
in your changes are missing coverage. Please review.
Project coverage is 68.22%. Comparing base (
0cdea0d
) to head (04c557e
).:exclamation: Current head 04c557e differs from pull request most recent head d81a951. Consider uploading reports for the commit d81a951 to get more accurate results
Files | Patch % | Lines |
---|---|---|
src/Poseidon/CLI/Jannocoalesce.hs | 64.51% | 19 Missing and 14 partials :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
OK, @nevrome, this is ready for review.
Ok - so I tried to run it for the open minotaur-archive PR here: https://github.com/poseidon-framework/minotaur-archive/pull/2
Some minor observations:
[Info] Writing to file (directory will be created if missing): 2021_CarlhoffNature/2021_CarlhoffNature.janno
. Maybe a message for each major step, like loading, merging and finally writing.Beyond that and most importantly :exclamation::
Merging did nothing in this case, because the Poseidon_IDs differ. Generally because of the _MNT
suffixes we asked @TCLamnidis to add for each sample, but also more specifically because of other suffixes/insets like _SG
and _ss
or _TF
Thiseas added for different versions of the data. This differences render jannocoalesce
in its current form functionally useless for its intended purpose.
I see two ways to fix this. Either we go with your initial idea and allow coalescing (?) by additional columns, e.g. the Alternative_IDs
column, where we would then have to add the community-archive IDs for the specific minotaur-archive usecase. Or we add a feature to apply some regex to the Poseidon_IDs before merging.
Funny how this issue goes back to the question what exactly a Poseidon_ID is. Note that I wrote about this for the paper in Supp. Text 5.2.
I was thinking that the suffixes would cause issue. I wonder if a Main_ID
/ Master_ID
column could be used for the matching to the Poseidon_ID
?
Potentially yes. If it was an Individual_ID
we could coalesce all columns that are relevant on the individual-level. A Poseidon_ID
refers to something pretty specific (see my write-up for the paper) and a Main_ID
would have to refer to an individual to be helpful and not muddle the waters even more.
Edit: I still think the suffix is a good thing, btw.
OK, I think we could easily put a flag to ignore suffixes in the target, as well as custom key-lookup-columns. I can implement both of these changes easily. Thanks for the feedback.
OK, so I've added quite a number of features:
--sourceKey ARG the janno column to use as the source key
(default: "Poseidon_ID")
--targetKey ARG the janno column to use as the target key
(default: "Poseidon_ID")
--stripIdSuffix ARG an optional regular expression to identify parts of
the IDs to strip before matching between source and
target
this should now help with the cases we discussed above. Note that I had to put in one safety check: If Regexes are given, it is possible that Janno rows differ in the Poseidon_ID
, which would then be copied over with --force
. So I now added a check that columns titled Poseidon_ID
, or the given key-name in --targetKey
, respectively, are never overwritten.
It would be nice if you could try things out. I've added a number of automatic tests now to JannocoalesceSpec.hs
, which should cover the main functionality. I have not expanded the golden tests yet.
Note that I think we need a bit more output on the statistics how many rows have been edited. I will work on that now.
OK, I've added some log-output. Here is an example run on the test-data in the package:
stephan_schiffels@BIONB100-2 poseidon-hs % trident jannocoalesce -s test/testDat/testJannoFiles/normal_full.janno -t test/testDat/testJannoFiles/normal_subset.janno -o ~/Desktop/test.txt
trident v1.4.1.0 for poseidon v2.5.0, v2.6.0, v2.7.0, v2.7.1
https://poseidon-framework.github.io
[Info] matched target individual with ID XXX011 with source individual with ID XXX011
[Info] -- copied "DE" from column Country_ISO
[Info] -- copied "Paul;Peter" from column Alternative_IDs
[Info] -- copied "xxx" from column Country
[Info] -- copied "yyy" from column Relation_Note
[Info] -- copied "father_of;grandfather_of" from column Relation_Type
[Info] -- copied "first;second" from column Relation_Degree
[Info] -- copied "XXX012;I1234" from column Relation_To
[Info] matched target individual with ID XXX012 with source individual with ID XXX012
[Info] -- copied "FR" from column Country_ISO
[Info] -- copied "xxx" from column Country
[Info] -- copied "daughter_of" from column Relation_Type
[Info] -- copied "first" from column Relation_Degree
[Info] -- copied "XXX011" from column Relation_To
[Info] matched target individual with ID XXX013 with source individual with ID XXX013
[Info] -- copied "EG" from column Country_ISO
[Info] -- copied "Skeleton Joe" from column Alternative_IDs
[Info] -- copied "xxx" from column Country
[Info] -- copied "xxx" from column Relation_Note
[Info] -- copied "sixthToTenth" from column Relation_Degree
[Info] -- copied "XXX011" from column Relation_To
[Info] Writing to file (directory will be created if missing): /Users/stephan_schiffels/Desktop/test.txt
And I also checked whether the in-place writing works, and it does (if you omit the -o
output file).
So I think this is ready for another try, @nevrome
I looked at the code and found it quite brilliant. Only two comments:
-- copied
messages are a bit too verbose for logInfo
, especially for large packages. Maybe it's better to report them via logDebug
.--stripIdSuffix
requires a special kind of regex syntax. Maybe it would be better to mention this in the command line documentation.I see that the tests are failing because of some issue with the base monad. I'll look into that now to see if I can easily fix it. Afterwards I'll go back to my use-case in and for the minotaur archive.
So the first thing I tried was this
trident jannocoalesce -s ../../community-archive/2021_CarlhoffNature/2021_CarlhoffNature.janno -t 2021_CarlhoffNature.janno --stripIdSuffix "_*$"
This is obviously wrong (missing qualifier before the *
quantifier in --stripIdSuffix
) and nothing was matched. The target file was reordered a bit, but no information was added from the source. trident jannocoalesce
returned nothing but
[Info] Writing to file (directory will be created if missing): 2021_CarlhoffNature.janno
in this case, which left me wondering if it actually worked or not. I think we should report this (probably common) failure more clearly, maybe with a message like this:
[Warning] No matches across source (key: Poseidon_ID) and target (key: Poseidon_ID) with --stripIdSuffix "_*$"
When I used a correct regex ("_.*$"
) things worked like a charm :heart:
I wonder, though, if we should rename --stripIdSuffix
to (for example) --stripIdRegex
. Suffixes are certainly the most common mismatch, but with this implementation also prefixes or infixes can be removed.
I'll now prepare a PR with the changes I suggested here and in the comment above.
See my PR #284.
When running jannocoalesce
for 2020_Margaryan_Viking (https://github.com/poseidon-framework/minotaur-archive/pull/4#issuecomment-1864569689)) I realized that the info messages for each target row (in this case 441) are still pretty verbose... But I guess people can always turn them off with --logMode NoLog
, so maybe it's OK.
I want to point out how well the regex solution works here. Still a big fan of that.
Great, thanks. I'll take a look at your PR at once. Sorry for letting this hang stale for so long!
OK, just for the record, I think we discussed that you, @nevrome, would perhaps add a few small things to this PR based on your experience with https://github.com/poseidon-framework/minotaur-archive/pull/5. So I will stand back for now unless told otherwise
OK, I've finished the CLI API and functionality, and it compiles. I haven't tested it yet.