[REVIEW]: relped: Build Relatedness Pedigrees

whedon commented 4 years ago

Submitting author: @RHagenson (Ryan Hagenson) Repository: https://github.com/rhagenson/relped Version: v1.0.0 Editor: @marcosvital Reviewer: @djmitche, @rabdill Archive: Pending

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b5c476c1ce08df5db16410accd7e09de"><img src="https://joss.theoj.org/papers/b5c476c1ce08df5db16410accd7e09de/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b5c476c1ce08df5db16410accd7e09de/status.svg)](https://joss.theoj.org/papers/b5c476c1ce08df5db16410accd7e09de)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@djmitche & @rabdill, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @marcosvital know.

✨ Please try and complete your review in the next two weeks ✨

Review checklist for @djmitche

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@RHagenson) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @rabdill

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@RHagenson) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[ ] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @djmitche, @rabdill it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 4 years ago

Reference check summary:

OK DOIs

- 10.1111/j.1471-8286.2006.01256.x is OK
- 10.1016/j.ajhg.2016.05.020 is OK
- 10.2307/2409206 is OK
- 10.1093/bioinformatics/bts460 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

marcosvital commented 4 years ago

Dear @djmitche and @rabdill, thank you very much for accepting review this submission for JOSS.

Please check the instructions and checklists above (I see that @djmitche already started), and let me know if you need any assistance. You can also tag @rhagenson if you need to ask specific questions about the submission.

djmitche commented 4 years ago

I'm having a bit of trouble figuring out what this does, as someone not familiar with this kind of analysis. The README is very clear about input formats, but for example doesn't explain what "relatedness" means. Is there another resource that could help me understand this sort of thing?

rhagenson commented 4 years ago

@djmitche Great question and a bit of oversight on my part assuming relatedness is more widely known. In short, relatedness is the measure of how much two individuals overlap genetically -- ranging from 0 (no relation) to 1 (clones). relped began its lifecycle as a quasi-extension to a tool called ML-Relate, which is where the codes under Relatedness originated from.

Perhaps the best reference for understanding relatedness is "Estimating relatedness using genetic markers" by Queller and Goodnight.

Does that help? Do you believe I need a brief section explaining relatedness in the README?

marcosvital commented 4 years ago

@djmitche, I think that would be nice to explain that on the manuscript's summary.

It's actually on JOSS guidelines, the summary should "have a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided".

So a quick and simple explanation on relatedness should be there - and perhaps a slightly more detailed explanation could fit in the README.

rabdill commented 4 years ago

Thanks for clarifying, @rhagenson. I agree with @marcosvital about adding some of this to the project—my only note so far was that there was very little background for how this could be applied. The software documentation itself is helpful, and I'm sure someone already doing this work would see how it fits into their workflow, but the description assumes a lot of prior knowledge.

rhagenson commented 4 years ago

@rabdill, @djmitche, and @marcosvital I will add a briefing on relatedness to the paper and README to frame relped.

@rabdill, specifically, when you say "there was very little background for how this could be applied" do you mean you had difficulties identifying the usefulness of relped, or where it fits in a hypothetical project?

If it would be helpful, I can put a few sentences in the paper about where the methodology was required. We developed relped to build the pedigree network of hundreds of captive animals to inform national breeding programs. However, this is not the only use of relped so I did not force this context into either the paper or documentation -- where it might distract from the form and function of the software itself.

rhagenson commented 4 years ago

I have shared relped (post-entering review) with a few agriculture former colleagues, conservationists, and zoo/aquarium staff as I can see where it would be useful in those areas. Already have a couple Thank You's for building it so I am confident that when @rabdill said "someone already doing this work would see how it fits into their workflow" that is very much true.

rhagenson commented 4 years ago

@whedon generate pdf

@rabdill, @djmitche, and @marcosvital, the update to add a brief introduction to relatedness in both manuscript and documentation has been completed.

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

arfon commented 4 years ago

Dear authors and reviewers

We wanted to notify you that in light of the current COVID-19 pandemic, JOSS has decided to suspend submission of new manuscripts and to handle existing manuscripts (such as this one) on a "best efforts basis". We understand that you may need to attend to more pressing issues than completing a review or updating a repository in response to a review. If this is the case, a quick note indicating that you need to put a "pause" on your involvement with a review would be appreciated but is not required.

Thanks in advance for your understanding.

Arfon Smith, Editor in Chief, on behalf of the JOSS editorial team.

rhagenson commented 4 years ago

As far as I know, I am as available as usual so no immediate need to pause. However, have no expectation of the same reality among this submission's editor and reviewers.

Please be safe and protect yourselves; this submission can wait for everyone to be happy and healthy during these times. =)

On Fri, Mar 13, 2020, 7:22 PM Arfon Smith notifications@github.com wrote:

Dear authors and reviewers

We wanted to notify you that in light of the current COVID-19 pandemic, JOSS has decided to suspend submission of new manuscripts and to handle existing manuscripts (such as this one) on a "best efforts basis". We understand that you may need to attend to more pressing issues than completing a review or updating a repository in response to a review. If this is the case, a quick note indicating that you need to put a "pause" on your involvement with a review would be appreciated but is not required.

Thanks in advance for your understanding.

Arfon Smith, Editor in Chief, on behalf of the JOSS editorial team.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/2124#issuecomment-598981346, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJJ6PZWI7D6QGZ5GOYT46LRHLE4NANCNFSM4K5FODGA .

rabdill commented 4 years ago

Hi! Working my way through the rest of the review now, thank you for your patience. I have a few minor notes, but I ran into something that might be a bigger issue, so I'll just post that until I can make more progress.

I'm not clear on whether this is a problem or not, but I noticed that the inferred connections are not consistent between runs. I realize the layout of the diagrams may change, but the readme and manuscript state that all connections will be the same between runs. Not all of the differences I found could be fixed by just adding the "undirected" option, so I wanted to bring this up just in case.

I generated four diagrams using identical inputs with

relped build --relatedness="example-data/relatedness-nums-and-codes.csv" \
  --parentage="example-data/parentage.csv" --output "asdf.dot"

It generated output exactly as described, and I included the figures below with red circles around a few discrepancies I noticed:

iteration 1: connection from M2 to F3
- iteration 2: M2 connected to F3, then another connection from F3 back to M2 (should this happen?) not present in the other iterations
- iterations 3 and 4: no connection between M2 and F3, but both have a connection from F7 to F3 instead.
iteration 1 and 3: connection from O4 to O11
- iteration 2: no connection
- iteration 4: O11 connected to O4 in the opposite direction
iteration 1: no connection between M1 and F8
- iterations 2 and 3: connection from F8 to M1
- iteration 4: M1 connected to F8 in the opposite direction
iterations 1 and 2: connection from F2 to M2
- iterations 3 and 4: connection from M2 to F2 in the opposite direction

I thought maybe this would go away with more data, so I tried running it again including the demographics file, but I found several inconsistent connections between the first 2 iterations there as well. @rhagenson, would you mind clarifying if this is expected behavior? Please let me know if it'd be helpful to have this in a different format too; I'd be happy to send it some other way. Thanks.

rhagenson commented 4 years ago

@rabdill, I am investigating this right now one piece at a time (so expect follow-up messages).

On M2->F3 not appearing consistently: it appears what is happening is that when M2->F3 does not appear, F7->F3 appears. What I suspect is happening is that the quasi-random order of which edges are pruned is "deciding" whether M2->F3 or F7->F3 is "chosen" for the final plot. In the input that you are using, both M2 and F7 are listed as half-siblings (HS) of F3. Given it is expected behavior (in my opinion) that ALL codified entries appear in the output, I would say that this is a bug and I need to investigate when/where codified entries are being removed.

rhagenson commented 4 years ago

Loops, such as in iteration 2 should be removed already. That one is going to take some head work to solve as there is already a cycle removal step.

rhagenson commented 4 years ago

Loops, such as in iteration 2 should be removed already. That one is going to take some head work to solve as there is already a cycle removal step.

I think I know what I did here. The step did exist, but then at a later stage of development I removed it as the (mistakenly) thought the improved pruning covered it already so removed what was thought of as a unnecessary repetition of work.

rhagenson commented 4 years ago

On O4->O11: the direction switch is expected behavior. Without additional parentage or demographic information, a direction is chosen. In the input you use, although you provide parentage there is no way to infer the age of O4 and O11 to direct it the same way on every iteration. However, this relationship dropping out is strange. I will need to find a pattern to when it drops out to determine the solution. My initial suspicion is that either the relationship is not being added when it should, or it is being removed because there exists some other path that is, by chance, weighted higher.

rhagenson commented 4 years ago

On the M1 and F8 disconnect, I suspect the same programmatic disconnect as M2->F3 or F7->F3 because M1->F8 is a codified entry of half-sibling. Ensuring any codified entries cannot be dropped (as I would expect from the program as a user) will fix this issue.

rhagenson commented 4 years ago

One the F2 to M2 disconnect. See the above comment on M1 and F8.

Ensuring any codified entries cannot be dropped (as I would expect from the program as a user) will fix this issue

rhagenson commented 4 years ago

Overall, @rabdill, it seems you IDed some regressions in the program logic which predated effective testing for such regressions.

The two remedies that should fix all the ID problems are:

Fixing the loop-removal procedure to ensure no loops between different unknown/inferred individuals among known individuals
Specially mark any codified relationship and ensure those relationships cannot be removed

Thank you for catching these issues!

djmitche commented 4 years ago

I feel like I'm playing catch-up - thanks for your detailed analysis @rabdill! A few minor notes:

Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

I think 2 and 3 are covered, but 1 is not. GitHub has suggestsions for both a contributing guide and a CoC. The CoC probably isn't going to see any use at this stage, but the few times I have had to call on a CoC I've been glad it was already in place, so I'd suggest creating one.

Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).

I see that there is example data in the repo, but it's not clear which files belong together. Also, the only example usage is in the multiple-runs example. Is it expected that the software is always used in that mode, or would it make sense to give a single-run example that uses data from the example-data directory?

rhagenson commented 4 years ago

@djmitche

I can add a CONTRIBUTING doc and agree that a CoC is probably not going to be used at this stage, but could be useful to have in place well before it needs to be used.

The example-data files were named according to what option in the CLI they correspond to, therefore all the relatedness-* files go to --relatedness and same for parentage* and demographics*. As for multi versus single run, I would expect multi-run is the manner most users will run it in; I only have some much control over how GraphViz will layout a particular relped output so multi-run allows for rapidly building what should be the same semantic graph where connections are consistent, but layout changes (Rich pointed out this was not always the case). I will still add a single run example just above the multi run example as it shouldn't distract from usage by any means so is harmless at worst and helpful at best.

djmitche commented 4 years ago

If the multi-run option is the normal course, it's fine to list only that example. Perhaps a small rewording of the README would help others not wonder the same thing I did :)

rhagenson commented 4 years ago

I can work on the wording. Adding a single-run example for initial exploration of the tool for newcomers is probably more beneficial than not so will likely still change that part of the README.

djmitche commented 4 years ago

I'm sorry I've been slow to get back to this -- I really have no excuse. At this point I'm just waiting on CONTRIBUTING.

One thing in the paper that leaves me with a lingering uncertainty is "As more information is provided, connections are better resolved." Perhaps this is my ignorance of the subject matter, but given a body of relatedness information that does not uniquely identify a single pedigree, what does it mean to be "poorly resolved"? Or to ask another way, what is the prior state on which "better resolved' is based? I can imagine that a user of the software would want to know concretely, would it ever draw an edge between two individuals that would later be removed given additional (non-contradictory) information?

If that's already known to someone doing this kind of work, then please consider it a question for my own curiosity (and don't feel obliged to answer it!)

rhagenson commented 4 years ago

@djmitche I am not concerned about any delays right now. Not "normal" times and I would not be overly concerned during normal times anyhow. :smile:

As for the "better resolved" point, that is intentionally open wording that is understood by those doing the work already. In brief, because there is still some expert knowledge and outside information that can factor into building the "true pedigree" as opposed to the "pedigree network" that relped builds, I do not differentiate between any information known at the start of a project and information discovered during a project. For example, relped was written when we had to build the pedigree of captive animals, but despite best efforts we do not always "know" the relationship between captive animals (i.e., sometimes mistakes are made) so we started with relatedness scores built from genotypes then, as we confirmed the knowns we encoded those as parent-offspring (PO), full-sibling (FS), and half-sibling (HS). We built many pedigrees during the course of the project to balance those concerns which could not so readily be additional features of relped.

This progressive pedigree building is also the reason I was so shocked when Rich pointed out that some encoded input entries do not always appear in the final pedigree. This is a breaking bug which I still need to resolve.

Does that answer your inquiry?

I will get the CONTRIBUTING added later. I intend to "steal" the content from another project of mine here.

rhagenson commented 4 years ago

CONTRIBUTING has been added

djmitche commented 4 years ago

Thanks for the explanation. The key point for my review is "wording that is understood by those doing the work already" :)

djmitche commented 4 years ago

All of the checkboxes for my review are complete, so once @rabdill's concerns are addressed I think this is good to go.

arfon commented 4 years ago

:wave: all, just a friendly check-in to see how things are going with this review?

rabdill commented 4 years ago

Hi @arfon. I think my review is technically the missing piece. I'm not sure how this works within the JOSS review process, but I was hoping to wait for the completion of two issues that popped up during review: https://github.com/rhagenson/relped/issues/42 https://github.com/rhagenson/relped/issues/43

In short, one of the stated expectations of the software is that its conclusions are consistent across runs, but there are some cases where that isn't the case, and identical inputs can result in different outputs. I believe @rhagenson is working on it. Based on the functional claims of the software, I'm not comfortable checking the "Functionality" box until those things are fixed. I suppose an alternative would be to change the documentation to reflect a nondeterministic implementation, but it's my impression this would be a significant step backwards for the software itself.

djmitche commented 4 years ago

(I've marked my review as complete but I concur with Rich's concerns)

rhagenson commented 4 years ago

I agree that it should not be moved forward until I address the issues Rich noted. I apologize about any delays; I was correct that I am as available as ever, but that availability has not translated into productivity. My plan is to make significant progress on the aforementioned issues this week. Thank you to each of you for your patience with me.

kthyng commented 4 years ago

No problem that some items are being worked on, but I will mark this submission as "paused" in the meantime for our internal bookkeeping.

djmitche commented 4 years ago

(feel free to reassign me when this is again in motion -- un-assigning to remove it from my github dashboards)

djmitche commented 4 years ago

Well, so much for that idea...

arfon commented 3 years ago

I agree that it should not be moved forward until I address the issues Rich noted. I apologize about any delays; I was correct that I am as available as ever, but that availability has not translated into productivity. My plan is to make significant progress on the aforementioned issues this week. Thank you to each of you for your patience with me.

@rhagenson@ @marcosvital - it looks like this submission is very close to crossing the line here. Is there anything we can do to help you wrap this up :soon: ?

Well, so much for that idea...

@djmitche apologies for this. @whedon 'helpfully' goes around looking for missing assignments on JOSS issues every 24 hours so even if you unassigned yourself, you'll get reassigned by Whedon.

marcosvital commented 3 years ago

Hi, @rhagenson! Hope everything is fine with you. Did you made any progress on the two issues? Let us know if you have any news.

rhagenson commented 3 years ago

@marcosvital, I am still alive. I have many reasons for the delay, but the most informative is that in May I rapidly moved from my work in conservation (which this submission was a part of) to being a dedicated COVID-19 data analyst with my local health department. I have spent the last few months consuming COVID-19 literature, responding to outbreaks, and building computational scaffolding to help get us through this pandemic. I have no intention on abandoning this work, but I also cannot say for certain how soon I will have the time and energy to resolve the final two issues.

Context, I am dedicated to the largest city affecting the blue line below (source): Screenshot from 2020-11-12 12-22-51

marcosvital commented 3 years ago

Hi, @rhagenson! Thank you for the update, I'm glad to know that you are fine. Since you are not abandoning the project, we will keep this submission as paused, and wait for you.

@djmitche and @rabdill, thank you so much for all the time you dedicated into reviewing this submission. This is probably going to be the last notification for a while, until @rhagenson get to work on this again. When that happens, I can make the final checks if you happen to not be available anymore.

djmitche commented 3 years ago

Hi folks -- should we consider closing this? It's been in my list of assigned issues for a long time....

marcosvital commented 3 years ago

Hi, @rhagenson! I hope everything is fine with you. Let us know if you have any plans to work again on this submission, ok?

marcosvital commented 3 years ago

Hi, @djmitche! Thank you for all the effort you put so far on this submission. While it is possible that Ryan is ready to work on it again, it is also possible that we will let this paused for a while longer. If you feel that is better to step out of the process at this point, it will be no problem at all: I can take the role of a reviewer as we move on.

If you choose to do that, my suggestion would be for you to just turn off the notifications from this issue - this way you will stay listed as a reviewer (I think this would be the best, since you already did a lot of work here).

arfon commented 3 years ago

:wave: @rhagenson – just checking in here to see if your intent is still to complete this process?

djmitche commented 3 years ago

My task-tracker says it's been 1.7 years here and no response. I think it's time to close this?

rhagenson commented 3 years ago

You know there is still a pandemic going on, right? And as mentioned above, I now work in epidemiology? Your task-tracker is my lowest priority.

@arfon My intent is still to complete this; that has not changed. However, nothing about my ability to reprioritize this has change either.

djmitche commented 3 years ago

Great, thanks for the reply!

arfon commented 3 years ago

@rhagenson – I appreciate that your attention is on other (very important) work, but JOSS does not have an open-ended commitment to you to wait for your updates (this submission is coming up to 2 years old now). As such, I'm going to go ahead and close this review. Should you find time to make these updates in the future we'd welcome a resubmission.

@djmitche @rabdill @marcosvital – many thanks for all of your efforts here.

openjournals / joss-reviews