Jannocoalesce - Githubissues

stschiff commented 11 months ago

OK, I've finished the CLI API and functionality, and it compiles. I haven't tested it yet.

codecov[bot] commented 11 months ago

Codecov Report

Attention: Patch coverage is 64.51613% with 33 lines in your changes are missing coverage. Please review.

Project coverage is 68.22%. Comparing base (0cdea0d) to head (04c557e).

:exclamation: Current head 04c557e differs from pull request most recent head d81a951. Consider uploading reports for the commit d81a951 to get more accurate results

Files	Patch %	Lines
src/Poseidon/CLI/Jannocoalesce.hs	64.51%	19 Missing and 14 partials :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #282 +/- ## ========================================== - Coverage 68.32% 68.22% -0.11% ========================================== Files 25 26 +1 Lines 3375 3468 +93 Branches 376 390 +14 ========================================== + Hits 2306 2366 +60 - Misses 693 712 +19 - Partials 376 390 +14 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

stschiff commented 10 months ago

OK, @nevrome, this is ready for review.

nevrome commented 10 months ago

Ok - so I tried to run it for the open minotaur-archive PR here: https://github.com/poseidon-framework/minotaur-archive/pull/2

Some minor observations:

Overwriting in place worked there, so maybe my comment above was unnecessary.
It would be nice to have some more CLI output, not just [Info] Writing to file (directory will be created if missing): 2021_CarlhoffNature/2021_CarlhoffNature.janno. Maybe a message for each major step, like loading, merging and finally writing.
The columns of the output .janno file are reordered according to the column order specified in poseidon-hs. This is sensible, but maybe a bit surprising. Maybe we should announce this somewhere in the process or the documentation.

Beyond that and most importantly :exclamation:: Merging did nothing in this case, because the Poseidon_IDs differ. Generally because of the _MNT suffixes we asked @TCLamnidis to add for each sample, but also more specifically because of other suffixes/insets like _SG and _ss or _TF Thiseas added for different versions of the data. This differences render jannocoalesce in its current form functionally useless for its intended purpose.

I see two ways to fix this. Either we go with your initial idea and allow coalescing (?) by additional columns, e.g. the Alternative_IDs column, where we would then have to add the community-archive IDs for the specific minotaur-archive usecase. Or we add a feature to apply some regex to the Poseidon_IDs before merging.

Funny how this issue goes back to the question what exactly a Poseidon_ID is. Note that I wrote about this for the paper in Supp. Text 5.2.

TCLamnidis commented 10 months ago

I was thinking that the suffixes would cause issue. I wonder if a Main_ID / Master_ID column could be used for the matching to the Poseidon_ID?

nevrome commented 10 months ago

Potentially yes. If it was an Individual_ID we could coalesce all columns that are relevant on the individual-level. A Poseidon_ID refers to something pretty specific (see my write-up for the paper) and a Main_ID would have to refer to an individual to be helpful and not muddle the waters even more.

Edit: I still think the suffix is a good thing, btw.

stschiff commented 10 months ago

OK, I think we could easily put a flag to ignore suffixes in the target, as well as custom key-lookup-columns. I can implement both of these changes easily. Thanks for the feedback.

stschiff commented 10 months ago

OK, so I've added quite a number of features:

--sourceKey ARG          the janno column to use as the source key
                           (default: "Poseidon_ID")
  --targetKey ARG          the janno column to use as the target key
                           (default: "Poseidon_ID")
  --stripIdSuffix ARG      an optional regular expression to identify parts of
                           the IDs to strip before matching between source and
                           target

this should now help with the cases we discussed above. Note that I had to put in one safety check: If Regexes are given, it is possible that Janno rows differ in the Poseidon_ID, which would then be copied over with --force. So I now added a check that columns titled Poseidon_ID, or the given key-name in --targetKey, respectively, are never overwritten.

It would be nice if you could try things out. I've added a number of automatic tests now to JannocoalesceSpec.hs, which should cover the main functionality. I have not expanded the golden tests yet.

stschiff commented 10 months ago

Note that I think we need a bit more output on the statistics how many rows have been edited. I will work on that now.

stschiff commented 10 months ago

OK, I've added some log-output. Here is an example run on the test-data in the package:

stephan_schiffels@BIONB100-2 poseidon-hs % trident jannocoalesce -s test/testDat/testJannoFiles/normal_full.janno -t test/testDat/testJannoFiles/normal_subset.janno -o ~/Desktop/test.txt
trident v1.4.1.0 for poseidon v2.5.0, v2.6.0, v2.7.0, v2.7.1
https://poseidon-framework.github.io

[Info]    matched target individual with ID XXX011 with source individual with ID XXX011
[Info]    -- copied "DE" from column Country_ISO
[Info]    -- copied "Paul;Peter" from column Alternative_IDs
[Info]    -- copied "xxx" from column Country
[Info]    -- copied "yyy" from column Relation_Note
[Info]    -- copied "father_of;grandfather_of" from column Relation_Type
[Info]    -- copied "first;second" from column Relation_Degree
[Info]    -- copied "XXX012;I1234" from column Relation_To
[Info]    matched target individual with ID XXX012 with source individual with ID XXX012
[Info]    -- copied "FR" from column Country_ISO
[Info]    -- copied "xxx" from column Country
[Info]    -- copied "daughter_of" from column Relation_Type
[Info]    -- copied "first" from column Relation_Degree
[Info]    -- copied "XXX011" from column Relation_To
[Info]    matched target individual with ID XXX013 with source individual with ID XXX013
[Info]    -- copied "EG" from column Country_ISO
[Info]    -- copied "Skeleton Joe" from column Alternative_IDs
[Info]    -- copied "xxx" from column Country
[Info]    -- copied "xxx" from column Relation_Note
[Info]    -- copied "sixthToTenth" from column Relation_Degree
[Info]    -- copied "XXX011" from column Relation_To
[Info]    Writing to file (directory will be created if missing): /Users/stephan_schiffels/Desktop/test.txt

And I also checked whether the in-place writing works, and it does (if you omit the -o output file).

stschiff commented 10 months ago

So I think this is ready for another try, @nevrome

nevrome commented 10 months ago

I looked at the code and found it quite brilliant. Only two comments:

I think the -- copied messages are a bit too verbose for logInfo, especially for large packages. Maybe it's better to report them via logDebug.
You pointed out to me that --stripIdSuffix requires a special kind of regex syntax. Maybe it would be better to mention this in the command line documentation.

I see that the tests are failing because of some issue with the base monad. I'll look into that now to see if I can easily fix it. Afterwards I'll go back to my use-case in and for the minotaur archive.

nevrome commented 10 months ago

So the first thing I tried was this

trident jannocoalesce -s ../../community-archive/2021_CarlhoffNature/2021_CarlhoffNature.janno -t 2021_CarlhoffNature.janno --stripIdSuffix "_*$"

This is obviously wrong (missing qualifier before the * quantifier in --stripIdSuffix) and nothing was matched. The target file was reordered a bit, but no information was added from the source. trident jannocoalesce returned nothing but

[Info]    Writing to file (directory will be created if missing): 2021_CarlhoffNature.janno

in this case, which left me wondering if it actually worked or not. I think we should report this (probably common) failure more clearly, maybe with a message like this:

[Warning] No matches across source (key: Poseidon_ID) and target (key: Poseidon_ID) with --stripIdSuffix "_*$"

When I used a correct regex ("_.*$") things worked like a charm :heart:

I wonder, though, if we should rename --stripIdSuffix to (for example) --stripIdRegex. Suffixes are certainly the most common mismatch, but with this implementation also prefixes or infixes can be removed.

I'll now prepare a PR with the changes I suggested here and in the comment above.

nevrome commented 10 months ago

See my PR #284.

When running jannocoalesce for 2020_Margaryan_Viking (https://github.com/poseidon-framework/minotaur-archive/pull/4#issuecomment-1864569689)) I realized that the info messages for each target row (in this case 441) are still pretty verbose... But I guess people can always turn them off with --logMode NoLog, so maybe it's OK.

I want to point out how well the regex solution works here. Still a big fan of that.

stschiff commented 9 months ago

Great, thanks. I'll take a look at your PR at once. Sorry for letting this hang stale for so long!

stschiff commented 8 months ago

OK, just for the record, I think we discussed that you, @nevrome, would perhaps add a few small things to this PR based on your experience with https://github.com/poseidon-framework/minotaur-archive/pull/5. So I will stand back for now unless told otherwise

poseidon-framework / poseidon-hs

Jannocoalesce #282

Codecov Report