[REVIEW]: molic: An R package for multivariate outlier detection in contingency tables

whedon commented 5 years ago

Submitting author: @mlindsk (Mads Lindskou) Repository: https://github.com/mlindsk/molic Version: v1.0.0 Editor: @csoneson Reviewer: @jdeligt, @jkanche Archive: 10.5281/zenodo.3475854

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/9fa65ced7bf3db01343d68b4488196d8"><img src="https://joss.theoj.org/papers/9fa65ced7bf3db01343d68b4488196d8/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/9fa65ced7bf3db01343d68b4488196d8/status.svg)](https://joss.theoj.org/papers/9fa65ced7bf3db01343d68b4488196d8)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@jdeligt & @jkanche, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @csoneson know.

✨ Please try and complete your review in the next two weeks ✨

Review checklist for @jdeligt

Conflict of interest

[x] As the reviewer I confirm that I have read the JOSS conflict of interest policy and that there are no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Version: v1.0.0
[x] Authorship: Has the submitting author (@mlindsk) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Authors: Does the paper.md file include a list of authors with their affiliations?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

Review checklist for @jkanche

Conflict of interest

[x] As the reviewer I confirm that I have read the JOSS conflict of interest policy and that there are no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Version: v1.0.0
[x] Authorship: Has the submitting author (@mlindsk) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Authors: Does the paper.md file include a list of authors with their affiliations?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

whedon commented 5 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @jdeligt, @jkanche it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 5 years ago

Attempting PDF compilation. Reticulating splines etc...

whedon commented 5 years ago

:point_right: Check article proof :page_facing_up: :point_left:

csoneson commented 5 years ago

@jdeligt, @jkanche - please perform your review here, following the instructions in the post above. Don't hesitate to ping me if you have questions. @mlindsk - please have a look at the points in the reviewer checklists above and make sure that all required parts are included in your submission, or add them already now :)

jdeligt commented 5 years ago

@mlindsk could you please confirm when you have gone through the checklist. I can't find a version/release; https://github.com/mlindsk/molic/issues/11.

mlindsk commented 5 years ago

@jdeligt I have now gone through the list and I have made a release v1.0.0.

jdeligt commented 5 years ago

@mlindsk I'll progress with the testing of the software but please update the version of the publication (v0.5.0) to match the github release

mlindsk commented 5 years ago

@whedon set v1.0.0 as version

whedon commented 5 years ago

I'm sorry @mlindsk, I'm afraid I can't do that. That's something only editors are allowed to do.

mlindsk commented 5 years ago

@csoneson Could you change the version as requested to v1.0.0. Thank you.

@mlindsk I'll progress with the testing of the software but please update the version of the publication (v0.5.0) to match the github release

ooo[bot] commented 5 years ago

:wave: Hey @mlindsk...

Letting you know, @csoneson is currently OOO until Monday, September 2nd 2019. :heart:

jdeligt commented 5 years ago

Hi @mlindsk, some small comments first; while you're on it please also update the version in your documentation to match. I also can't find a "Contributing" doc or text about community guidelines in the README. Could the proceeding reference point to this: https://arxiv.org/abs/1301.2267 to have a stable identifier?

Then a slightly bigger comment, I'm making some assumptions about what you mean please correct me if they are incorrect. To me both the example data and the suggested Genetics seem 'perfect world' examples. i.e. the 2 datasets are very clearly distinguishable (i.e. the minigaplotypes have been selected to distinguish these groups) and as such they are confidently identified as outliers. How does the algorithm perform on an individual that has mixed haplotypes? And something that I'm also wondering about (but this might be my unfamiliarity with the specific forensic implications), why are traditional approaches like phylogenetics / Hemming distance / PCA / kmers / ... not useful to distinguish the groups? I can see how the algorithm could offer a generalized approach to outlier detection for non-categorical data but I struggle to see the added value for the genetics case.

mlindsk commented 5 years ago

Hi @jdeligt

Version should now be updated (hopefully all places now). I am not sure, that adding a contributing doc would add much. Would it suffice if (in README) I include a section "Contribute, issues, and support" as done at https://github.com/mikldk/malan ?

According to the reference. Are you asking if I could skip the reference to https://doi.org/10.1109/ictai.2004.100 ? If so; it adds a crucial mistake from https://arxiv.org/abs/1301.2267 and cannot be left out. They complement each other.

It is true, that Europeans and Africans are easy to distinguish based on these haplotypes. However, it is not that easy for populations that are more "similar" as seen in Table 1 in the paper.md. Even though the haplotypes are selected for discriminatory purposes, we still need a (probabilistic) model to decide whether or not a profile is an outlier. The outlier model is only "as good as data" allow. What the model adds, is the possibility to take advantages of the mutual relationships among all variables. If one falsely assume independence between all variables the results are much worse (see Table 1).

Our method is an outlier test, which the models you refer to are not. Moreover, our model has the ability to investigate why a profile was declared an outlier since we have this nice graphical representation.

The method is really not for distinguishing groups but for outlier detection - a profile might not be declared as an outlier in several populations.

Within the genetic world our method is a direct generalization of the community accepted method https://www.fsigeneticssup.com/article/S1875-1768(17)30216-0/fulltext which only handles independent SNPs (a graph with no edges)

Also the software is intended for completely general purposes. The only requirement is data is categorical.

I hope this clarifies things - just ping me again if something is unclear.

jdeligt commented 5 years ago

Hi @mlindsk, I'll try to clarify and answer the questions in order.

(Contributing) That would be the minimal thing to do, it helps people in understanding what model you prefer and what to expect. Some guidelines/examples are here: https://help.github.com/en/articles/setting-guidelines-for-repository-contributors
(Reference) I was refeering to "deshpande01_efficient" which at the moment does not have a DOI in paper.bib
(Genomics) I'm not debating wether the algorithm is a usefull outlier detection tool. I'm debating 2 things, a. Do the provided examples sufficiently show that it is a sensitive / usefull outlier detection tool for complex data. b. Is the biological use case you describe apporiate to substantiate your claim of it being usefull (I'll ellaborate a bit more below).

To prefent the review from becoming a clah of fields I would suggest adding some text to clarify why this method is relevant for this problem within the particular setting and to put it in context of the wider population genetics field. And please review wether the comparison you make is fair/valid when taking into acocunt the biological constraints that informed the data selection. You could consider to just hava a statement about how this approach properly accounts for the underlying biology,

(Genomics extened) The loci you study have been selected to be in LD and therfore linked, this means that biology dictates that performance will be worse if not considered as such. Therefor an algorithm that takes this into account will always do better than one that is not. Based on the sole fact that it approriatly accounts for the relationships. This means that your algorithm is the apporiate choice for this problem, not that the performace is better. The other approach is simply not fit to do this. Therefore the comparison (to mee) is mute.

(Ancestry extended) As indicated I cannot speak from a forencics background but this method is not the accepted standerd in human genomics and population genetics (a few examples; https://www.nature.com/articles/nature18964, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4836868/, https://www.nature.com/articles/s41586-018-0579-z). This makes it difficult for me to see why an outlier detection would be the approriate choice for forencics. I.e. I would expect that forencics would need an assignment of each haplotype to its respective ancestry to support identification. I.e. a mixed haplotype being an outlier for all populations does not seem the most usefull (again I can be horribly wrong but from a biology perspective this would be what I would expect).

Hope this helps to clarify where I'm coming from and provide some more context to the suggested changes to the 'Statement of need'.

All the best, Joep

mlindsk commented 5 years ago

@whedon generate pdf

whedon commented 5 years ago

Attempting PDF compilation. Reticulating splines etc...

whedon commented 5 years ago

:point_right: Check article proof :page_facing_up: :point_left:

mlindsk commented 5 years ago

Hi @mlindsk, I'll try to clarify and answer the questions in order.

1. (Contributing) That would be the minimal thing to do, it helps people in understanding what model you prefer and what to expect. Some guidelines/examples are here: https://help.github.com/en/articles/setting-guidelines-for-repository-contributors

Ok. Agreed.

2. (Reference) I was refeering to "deshpande01_efficient" which at the moment does not have a DOI in paper.bib

Ok.

3. (Genomics) I'm not debating wether the algorithm is a usefull outlier detection tool. I'm debating 2 things, a. Do the provided examples sufficiently show that it is a sensitive / usefull outlier detection tool for complex data. b. Is the biological use case you describe apporiate to substantiate your claim of it being usefull (I'll ellaborate a bit more below).

We agree that data could be more complex. However, increasing the complexity in the examples does not necessarily help the reader in understanding how the software works.

To prefent the review from becoming a clah of fields I would suggest adding some text to clarify why this method is relevant for this problem within the particular setting and to put it in context of the wider population genetics field. And please review wether the comparison you make is fair/valid when taking into acocunt the biological constraints that informed the data selection. You could consider to just hava a statement about how this approach properly accounts for the underlying biology,

The DGMs are fitted separately within in each population and haplotype. For a specific population, the fitted DGMs are merged to a union graph. Thus, we take into account the biological nature. I.e. we exploit that haplotypes are independent between chromosomes.

(Genomics extened) The loci you study have been selected to be in LD and therfore linked, this means that biology dictates that performance will be worse if not considered as such. Therefor an algorithm that takes this into account will always do better than one that is not. Based on the sole fact that it approriatly accounts for the relationships. This means that your algorithm is the apporiate choice for this problem, not that the performace is better. The other approach is simply not fit to do this. Therefore the comparison (to mee) is mute.

(1)

Yes, our expectation was indeed that the model would outperform the one neglecting dependencies. But the extend to which it outperforms the model with independencies is not obvious before doing the tests. Also, currently there are no other outlier detection tool using decomposable graphical models (DGMs). And the use of DGMs allow for a solid and well understood statistical way for reading off conditional independencies among the variables. One model that we can compare to though is the saturated model (a complete graph) corresponding to using frequency estimates. However, for high-dimensional data the saturated model breaks down since you would need an impossible amount of observations to catch rare events. Our approach remedies this, by focusing on local information (cliques in the fitted graph) from which it is way easier to investigate the nature of an outlier.

(Ancestry extended) As indicated I cannot speak from a forencics background but this method is not the accepted standerd in human genomics and population genetics (a few examples; https://www.nature.com/articles/nature18964, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4836868/, https://www.nature.com/articles/s41586-018-0579-z). This makes it difficult for me to see why an outlier detection would be the approriate choice for forencics. I.e. I would expect that forencics would need an assignment of each haplotype to its respective ancestry to support identification. I.e. a mixed haplotype being an outlier for all populations does not seem the most usefull (again I can be horribly wrong but from a biology perspective this would be what I would expect).

(2)

What you refer to here is population structure which is not the same thing as an outlier test. In the forensic world, one could reach out for the STRUCTURE algorithm (https://www.ncbi.nlm.nih.gov/pubmed/10835412) for markers not in LD e.g. The problem here is, that if the (unknown) profile does not originate from any of the populations/components in the database, it will be assigned to the most probable one (under the model) which is simply wrong. In a court (forensic science), it is then better to state that the profile can not be assumed to originate from any of the populations in the reference database. To put it another way; we may have exclusive sample populations, but we do not have an exhaustive database of reference populations. This statement is crucial.

I have added the last part of (1) and a "Expert Knowledge" section paper.md to clarify things. Let me know if I have completely misunderstood you.

Hope this helps to clarify where I'm coming from and provide some more context to the suggested changes to the 'Statement of need'.

I would like to thank you for the time spend to reflect on the genetic applications. I hope, that we converge towards a common understanding all though it is a difficult task for two different backgrounds.

Regards, Mads

jkanche commented 5 years ago

Sorry for the delay in reviewing this, but i plan to finish this over the weekend

mlindsk commented 5 years ago

@jkanche Could you try again - I forgot to "suggest" the packages "dplyr" and "pander" in the vignette.

jkanche commented 5 years ago

@mlindsk let me know if i was unclear on something

devtools::check() fails, add pander package to either dependencies or suggests in the description file & require this package on the vignette
I'd recommend adding a class efs that extends list using a proper class definition. Then create instances of this class in efs_init and efs_step.
Also wondering if you have looked into the tbl_graph class from the tidygraph package. This would simplify a lot of things code wise and compatibility with other packages.
in the vignette, why not use all the haplotypes instead of the first 25 on the larger example ?
To make this reproducible, how did you process the 1000 genomes data to generate the datasets in the package ? may be add this preprocessing code as a vignette ?
I also agree with @jdeligt, I wonder if the example dataset is good to illustrate the model
did you validate the results from your model ?
would a more appropriate example be metagenomic datasets ? for example, If you have taxonomic counts from samples across disease and control groups, can you use this model to find the outliers ?

jkanche commented 5 years ago

@mlindsk i deleted my previous msg, to provide all my comments in one but thanks for fixing the issue quickly

mlindsk commented 5 years ago

@mlindsk let me know if i was unclear on something

* `devtools::check()` fails, add `pander` package to either dependencies or suggests in the description file & require this package on the vignette

Perfect you didn't have pander installed and cached this one. Should be fixed.

* I'd recommend adding a class `efs` that extends list using a proper class definition. Then create instances of this class in `efs_init` and `efs_step`.

Let me see if I get you right here. You want me to make a superclass (basically a constructor), efs, from which the subclasses efs_init and efs_step should inherit from? And also, the class of efs should include the list class? I can make this work if this is what you are referring to as "proper class definition".

* Also wondering if you have looked into the `tbl_graph` class from the `tidygraph` package. This would simplify a lot of things code wise and compatibility with other packages.

I was not aware of this package; it seems interesting. What efs does is really just outputting an adjacency list/adjacency matrix which tidygraph users can easily grab and explore. I

* in the vignette, why not use all the haplotypes instead of the first 25 on the larger example ?

The run time of the algorithm increases and it does not add more to the example.

* To make this reproducible, how did you process the 1000 genomes data to generate the datasets in the package ? may be add this preprocessing code as a vignette ?

Yes; if I keep data; see below.

* I also agree with @jdeligt, I wonder if the example dataset is good to illustrate the model

Since you both ( @jdeligt ) I might have to find another dataset and completely remove the old one. Sure it is interesting what you mention below about metagenomic datasets. But this could easily extend to a new paper using this outlier model (somewhat overkill to explain the software). I cant stress enough that the method is really generic, and is a general method for doing outlier detection in contingency tables. I guess, if I asked a third person, his or her opinion would be to include a dataset from his or her respective field of research. So maybe I should really just simulate data to be neutral. But that is really not interesting. On the other hand, I could also find a handful of real world examples to cover a diverse range of applications. I am not sure what the best approach is. Don´t take this the wrong way; your comments on this are truly welcome. I'm just not quite sure of to proceed in order to make it general as explained.

* did you validate the results from your model ?

Validate in what sense ?

* would a more appropriate example be metagenomic datasets ? for example, If you have taxonomic counts from samples across disease and control groups, can you use this model to find the outliers ?

Yes, i believe that the model would be ideal for such a use-case since what you refer to would classify as contingency data.

mlindsk commented 5 years ago

@csoneson, @jdeligt, @jkanche

Hi all,

Based on the comments I have now decided to make some major changes to the API. I have also implemented backward selection now, and I will try to make a much more clearer API using a common class for these methods.

According to the example data. Do you all agree that it should be wiped, and therefore cannot be modified to fully justify the outlier test? If not, how would you like it to be ideally presented then? If yes, besides @jkanche s suggestion about metagenomic data, do you have any suggestions?

@csoneson Is there a time limit I should be aware of. I'll of course strive to do things as quick as possible.

csoneson commented 5 years ago

@csoneson Is there a time limit I should be aware of. I'll of course strive to do things as quick as possible.

No, not really. It would be great to have an estimate of the time frame, and/or if you could update us in the process.

mlindsk commented 5 years ago

@csoneson Is there a time limit I should be aware of. I'll of course strive to do things as quick as possible.

No, not really. It would be great to have an estimate of the time frame, and/or if you could update us in the process.

Ok. I'll inform you concurrently.

jdeligt commented 5 years ago

@mlindsk I would say the example can be useful to illustrate the generality of the method. I just think the genetics example would be helped by putting it in context. I.e. why it is not a population genetics tool. Why outlier detection is relevant to forensics but not to pop gen. And how this model does account for the biology while the method compared to does not. If this context is not provided I would not be comfortable with this section or the comparison.

If you want to further demonstrate the generality maybe include another example, for me it would be a plus but not a requirement for publication.

All the best, Joep

mlindsk commented 5 years ago

@csoneson , @jdeligt , @jkanche

Update: Work is very much in progress. The new api design is being implemented. Hopefully, if everything goes as planned, I will have something ready within the next two weeks. Hopefully also sooner.

csoneson commented 4 years ago

Hi @mlindsk - just checking in here. Could you give us an update on the progress of the revision?

mlindsk commented 4 years ago

Hi @csoneson - The new API is now finished. I need to make a new vignette and some changes to the paper. I'm a little swamped regarding teaching, but hopefully it wont be long until I'm done now. I have merged the new code to the master branch.

mlindsk commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

Attempting PDF compilation. Reticulating splines etc...

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

mlindsk commented 4 years ago

@jkanche I will provide the preprocessing code of 1000GP soon @jdeligt I hope I have accomodated your suggestion about generality/forensics/pop gen. According to DOIs; I don't see any DOI for "Efficient stepwise selection in decomposable models" which is why I still haven't included it (did I miss something?).

jdeligt commented 4 years ago

@mlindsk thanks for the update and considering our suggestions. I'm happy with how you've now framed the example and the statement of need. DOI's are indeed in order, sorry for confusion. I'm happy to sign off on it.

mlindsk commented 4 years ago

@jkanche Preprocessing script is now included along with the original 1000GP data

jkanche commented 4 years ago

@mlindsk Thanks for making the changes and adding the script. Looks great!

mlindsk commented 4 years ago

Good morning all @jdeligt @jkanche @csoneson

It seems that we are in a final stage. @jkanche could you check for installation? I believe this is the only checkbox left. @csoneson would you consider changing the version number to v.1.0.0?

csoneson commented 4 years ago

@jdeligt, @jkanche - thanks for your reviews! And thanks @jdeligt for letting us know that you are happy with the current state.

@mlindsk - I have sent a PR with some fixes to the references in the paper, and a couple of additional comments. The version number will be set once the submission is accepted and archived in Zenodo (it will not change in this issue).

mlindsk commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

Attempting PDF compilation. Reticulating splines etc...

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

mlindsk commented 4 years ago

@csoneson

@mlindsk - I have sent a PR with some fixes to the references in the paper, and a couple of additional comments.

The saturated model is now motivated - it is a natural "go to model", but I argue that one shouldn't use it.

mlindsk commented 4 years ago

@jkanche Is it possible for you to make the installation check? I believe this is one of last things in the process now.

jkanche commented 4 years ago

@mlindsk Sorry about that, I thought I already checked it off. I did test the package and it works! Thanks for all the work you put in!

mlindsk commented 4 years ago

@jkanche No problem. And thanks for pushing things in a better direction.

@csoneson FYI - the reviewers are now finished and all checkboxes are checked.

csoneson commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

Attempting PDF compilation. Reticulating splines etc...

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

csoneson commented 4 years ago

@whedon check references

openjournals / joss-reviews