openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
694 stars 36 forks source link

[REVIEW]: Contextualized: Heterogeneous Modeling Toolbox #6469

Closed editorialbot closed 1 month ago

editorialbot commented 3 months ago

Submitting author: !--author-handle-->@cnellington<!--end-author-handle-- (Caleb Ellington) Repository: https://github.com/cnellington/Contextualized Branch with paper.md (empty if default branch): joss Version: v0.2.8 Editor: !--editor-->@fabian-s<!--end-editor-- Reviewers: @holl-, @pescap Archive: 10.5281/zenodo.11130703

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/eff44b075cc108df6151e5123975115f"><img src="https://joss.theoj.org/papers/eff44b075cc108df6151e5123975115f/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/eff44b075cc108df6151e5123975115f/status.svg)](https://joss.theoj.org/papers/eff44b075cc108df6151e5123975115f)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@holl- & @pescap, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @fabian-s know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @holl-

📝 Checklist for @pescap

editorialbot commented 3 months ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 3 months ago

Software report:

github.com/AlDanial/cloc v 1.90  T=0.12 s (680.9 files/s, 263179.1 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          44            903           1525           3951
Jupyter Notebook                16              0          22839           1264
CSS                              1             23             17            510
HTML                             1              7            110            348
Markdown                         7             60              0            196
TeX                              2             12              0            172
YAML                             5             18             33            121
CSV                              1              0              0            109
JavaScript                       1              9              2            109
TOML                             1              4              0             43
reStructuredText                 2             16             52              8
Bourne Shell                     2              0              0              4
SVG                              1              0              0              1
-------------------------------------------------------------------------------
SUM:                            84           1052          24578           6836
-------------------------------------------------------------------------------

Commit count by author:

   210  Caleb Ellington
   142  Ben Lengerich
    52  cnellington
    16  wtt102
     9  Andrea Rubbi
     5  juwayni lucman
     3  aaron10l
     2  PaulCCCCCCH
     1  Jannik Deuschel
editorialbot commented 3 months ago

Paper file info:

📄 Wordcount for paper.md is 862

✅ The paper includes a Statement of need section

editorialbot commented 3 months ago

License info:

🟡 License found: GNU General Public License v3.0 (Check here for OSI approval)

editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

editorialbot commented 3 months ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.48550/arXiv.2310.07918 is OK
- 10.48550/arXiv.2310.11340 is OK
- 10.1101/2023.12.01.569658 is OK
- 10.1016/j.jbi.2022.104086 is OK
- 10.1609/aaai.v34i03.5693 is OK

MISSING DOIs

- No DOI given, and none found for title: Varying-coefficient models
- No DOI given, and none found for title: Contextual Explanation Networks
- No DOI given, and none found for title: NOTMAD: Estimating Bayesian Networks with Sample-S...
- 10.1101/2020.06.25.20140053 may be a valid DOI for title: Discriminative Subtyping of Lung Cancers from Hist...
- No DOI given, and none found for title: Personalized Survival Prediction with Contextual E...

INVALID DOIs

- None
fabian-s commented 3 months ago

👋🏼 hi @cnellington @holl- @pescap -- this is the review thread for the paper. All of our communications will happen here from now on.

As a reviewer, the first step is to create a checklist for your review by entering

@editorialbot generate my checklist

as the top of a new comment in this thread.

These checklists contain the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. The first comment in this thread also contains links to the JOSS reviewer guidelines.

The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. When doing so, please mention openjournals/joss-reviews#6469 so that a link is created to this thread (and I can keep an eye on what is happening). Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.

We aim for reviews to be completed within about 2-4 weeks. Please let me know if any of you require some more time. We can also use EditorialBot (our bot) to set automatic reminders if you know you'll be away for a known period of time.

Please feel free to ping me (@fabian-s) if you have any questions/concerns.

fabian-s commented 3 months ago

@cnellington: while we wait for the reviews, please take a look at the missing DOIs above and see what's what.

cnellington commented 3 months ago

@editorialbot generate pdf

editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

cnellington commented 3 months ago

@editorialbot commands

editorialbot commented 3 months ago

Hello @cnellington, here are the things you can ask me to do:


# List all available commands
@editorialbot commands

# Get a list of all editors's GitHub handles
@editorialbot list editors

# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist

# Set a value for branch
@editorialbot set joss-paper as branch

# Run checks and provide information on the repository and the paper file
@editorialbot check repository

# Check the references of the paper for missing DOIs
@editorialbot check references

# Generates the pdf paper
@editorialbot generate pdf

# Generates a LaTeX preprint file
@editorialbot generate preprint

# Get a link to the complete list of reviewers
@editorialbot list reviewers
cnellington commented 3 months ago

@editorialbot check references

editorialbot commented 3 months ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/j.jbi.2022.104086 is OK
- 10.1609/aaai.v34i03.5693 is OK
- 10.48550/arXiv.2310.11340 is OK
- 10.1101/2023.12.01.569658 is OK
- 10.48550/arXiv.2111.01104 is OK
- 10.1111/j.2517-6161.1993.tb01939.x is OK
- 10.48550/arXiv.1705.10301 is OK
- 10.1101/2020.06.25.20140053 is OK
- 10.48550/arXiv.1801.09810 is OK
- 10.48550/arXiv.2310.07918 is OK

MISSING DOIs

- None

INVALID DOIs

- None
holl- commented 3 months ago

Review checklist for @holl-

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

holl- commented 3 months ago

@cnellington From the GitHub statistics, Ben Lengerich @blengerich has contributed the most to the library, at least in terms of lines of code. Why isn't he first author? Also Eric Xing is not on the list of contributors on GitHub. Did he actually contribute to the software?

cnellington commented 3 months ago

@holl- I authored the majority of our core Python packages under contextualized/, but out codebase also contains many .ipynb files for our documentation, website, and tutorials built with jupyter-book. @blengerich, in addition to making many contributions to our code codebase, also spearheaded our website/documentation efforts, and also did a large amount of standardized style formatting with black on our codebase. The line count differences are mainly due to the differences in .ipynb and .py file types, a few reorganizations of large notebook files into new directories, and automated formatting of large swathes of the codebase.

Notebook inflation examples:

Style formatting inflation:

Eric P. Xing is my PhD advisor, and Manolis Kellis is @blengerich's postdoc advisor. While they have not made direct contributions to the code, they have been essential in supporting and directing the software development.

holl- commented 3 months ago

@cnellington Okay, that's fine, then. 👍

holl- commented 3 months ago

@cnellington I have some comments for your easy demo notebook, see this issue

holl- commented 3 months ago

@cnellington Looking at your test setup, it seems that baselines/tests.py is not being run. The following files have less than 85% coverage. Is there a good reason for this? Otherwise, could you add more tests?

Module statements missing coverage
contextualized\analysis\effects.py 206 190 8%
contextualized\analysis\embeddings.py 67 59 12%
contextualized\analysis\accuracy_split.py 39 33 15%
contextualized\baselines\networks.py 178 135 24%
contextualized\easy\ContextualizedNetworks.py 97 71 27%
contextualized\analysis\pvals.py 31 22 29%
contextualized\analysis\utils.py 6 3 50%
contextualized\dags\losses.py 32 14 56%
contextualized\easy\ContextualGAM.py 9 4 56%
contextualized\modules.py 105 43 59%
contextualized\dags\graph_utils.py 102 27 74%
contextualized\utils.py 22 5 77%
contextualized\dags\lightning_modules.py 152 27 82%
holl- commented 3 months ago

@cnellington Your paper is missing an overview of the State of the field. In the 'Statement of need', could you add more concrete motivation for using this library and relate it to existing software solutions (incl. appropriate references)?

pescap commented 3 months ago

Review checklist for @pescap

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

fabian-s commented 3 months ago

@cnellington Your paper is missing an overview of the State of the field. In the 'Statement of need', could you add more concrete motivation for using this library and relate it to existing software solutions (incl. appropriate references)?

Would like to strongly reinforce this and point out this missing reference from the stats literature / R package that seems highly relevant in this context --

partykit
https://www.jmlr.org/papers/v16/hothorn15a.html implements, inter alia, trees and RFs with regression/classification models in each terminal node

cnellington commented 3 months ago

@fabian-s @holl- Thanks for the clarification, we previously had a more detailed and formal comparison of related works but cut this after seeing the example "Gala" submission provides only a high-level overview of the field by citing current usage. Most related work to Contextualized can be described by the generic method categories mentioned in our Statement of Need Section 1 (e.g. partykit and regression trees operate by data partitioning and are very similar in spirit to mixture modeling). I will re-add our formal problem definition and related works sections to clarify. Right now I am traveling, but you can expect an updated manuscript by next week at the latest.

cnellington commented 3 months ago

@editorialbot generate pdf

editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

cnellington commented 3 months ago

@editorialbot check references

editorialbot commented 3 months ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/j.jbi.2022.104086 is OK
- 10.1609/aaai.v34i03.5693 is OK
- 10.48550/arXiv.2310.11340 is OK
- 10.1101/2023.12.01.569658 is OK
- 10.48550/arXiv.2111.01104 is OK
- 10.1111/j.2517-6161.1993.tb01939.x is OK
- 10.48550/arXiv.1705.10301 is OK
- 10.1101/2020.06.25.20140053 is OK
- 10.48550/arXiv.1801.09810 is OK
- 10.48550/arXiv.2310.07918 is OK
- 10.1214/aos/1017939139 is OK
- 10.1016/j.isci.2019.03.021 is OK
- 10.1080/01621459.2021.2000866 is OK
- 10.1093/bioinformatics/btr239 is OK
- 10.1214/09-AOAS308 is OK
- 10.1198/106186008X319331 is OK

MISSING DOIs

- None

INVALID DOIs

- None
cnellington commented 3 months ago

@editorialbot generate pdf

editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

cnellington commented 3 months ago

@editorialbot check references

cnellington commented 3 months ago

@editorialbot generate pdf

editorialbot commented 3 months ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/j.jbi.2022.104086 is OK
- 10.1609/aaai.v34i03.5693 is OK
- 10.48550/arXiv.2310.11340 is OK
- 10.1101/2023.12.01.569658 is OK
- 10.48550/arXiv.2111.01104 is OK
- 10.1111/j.2517-6161.1993.tb01939.x is OK
- 10.48550/arXiv.1705.10301 is OK
- 10.1101/2020.06.25.20140053 is OK
- 10.48550/arXiv.1801.09810 is OK
- 10.48550/arXiv.2310.07918 is OK
- 10.1214/aos/1017939139 is OK
- 10.1016/j.isci.2019.03.021 is OK
- 10.1080/01621459.2021.2000866 is OK
- 10.1093/bioinformatics/btr239 is OK
- 10.1214/09-AOAS308 is OK
- 10.1198/106186008X319331 is OK

MISSING DOIs

- No DOI given, and none found for title: partykit: A Modular Toolkit for Recursive Partytio...

INVALID DOIs

- None
editorialbot commented 3 months ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

cnellington commented 3 months ago

@fabian-s @holl- I have updated the statement of need to include a formal problem definition and an overview of related work. The only missing DOI is for partykit, where the JMLR page did not have a DOI and the ACM DOI was invalid.

We will continue improving test coverage where possible while we wait for your feedback.

fabian-s commented 3 months ago

thx, statement of need / related work section seems fine to me now.

holl- commented 3 months ago

@cnellington The paper looks good now.

Could you explain the reason for having an easy module in your API? This seems to imply that everything else is hard to do. Having an easy start with a library is good but people shouldn't have to learn a different system for more involved tasks.

Assuming you have written an analysis using easy, is it possible to obtain the same results without using the easy API (possibly with more code)?

cnellington commented 3 months ago

@holl- The easy and analysis modules contain our user-facing APIs. The other modules are developer-facing, mainly written with PyTorch Lightning API patterns. The easy module internalizes training/evaluation of these PyTorch modules and exposes this through an SKLearn-like API wrapper. By definition, anything done in easy or analysis can be done with the developer-facing modules, but we do not expect users to know or use the other modules. Among developers, we do make a point to keep our private codebase in line with the PyTorch ecosystem so contextualized benefits internally from tools produced there.

holl- commented 3 months ago

@cnellington Okay, then how about calling that module model / models instead of easy?

Also I noticed that some classes have a plural name, e.g. ContextualizedBayesianNetworks. This is confusing. Does instantiating this class create multiple networks? If so, how about calling it a forest instead?

The documentation does not really explain the parameters, e.g. what are the bootstraps and archetypes? Could you add explanations or link to a page that introduces these terms?

pescap commented 2 months ago

@holl- The easy and analysis modules contain our user-facing APIs. The other modules are developer-facing, mainly written with PyTorch Lightning API patterns. The easy module internalizes training/evaluation of these PyTorch modules and exposes this through an SKLearn-like API wrapper. By definition, anything done in easy or analysis can be done with the developer-facing modules, but we do not expect users to know or use the other modules. Among developers, we do make a point to keep our private codebase in line with the PyTorch ecosystem so contextualized benefits internally from tools produced there.

I had the same question.

Then I read in easy_regression.ipynb:

Note: These contextualized.easy interfaces don't try to use best PyTorch practices (prioritizing easy use over computational efficiency), so it's recommended to use contextualized models for any heavy lifting (see this demo).

It would be beneficial to add this note (or a similar explanation) before in the docs or README.md.

What do you think about that?

cnellington commented 2 months ago

@holl- @pescap @fabian-s We've expanded our test coverage, including new tests for basic functionality and performance (e.g. contextualized.analysis.tests.TestPvals, contextualized.easy.tests.TestEasyRegression).

The most recent coverage metrics are below. All the files below 85% coverage can be justified. Embeddings, accuracy_split, and effects are all visualization/reporting utilities and can't be included in unittests, but we do verify their functionality in the model analysis demo. The only other files below 85% are contextualized.dags.graph_utils, which has several options for noise type that don't need to be tested, and contextualized.functions, where we do not test constant-valued functions (i.e. f(x) = c).

Module statements missing excluded coverage
contextualized/analysis/embeddings.py 70 62 0 11%
contextualized/analysis/accuracy_split.py 39 33 0 15%
contextualized/analysis/effects.py 208 143 0 31%
contextualized/dags/graph_utils.py 102 23 0 77%
contextualized/functions.py 32 6 0 81%
contextualized/regression/regularizers.py 19 2 0 89%
contextualized/analysis/bootstraps.py 10 1 0 90%
contextualized/easy/wrappers/SKLearnWrapper.py 218 17 0 92%
contextualized/regression/datasets.py 86 6 0 93%
contextualized/baselines/networks.py 171 10 0 94%
contextualized/easy/ContextualizedRegressor.py 17 1 0 94%
contextualized/dags/lightning_modules.py 152 7 0 95%
contextualized/easy/ContextualizedNetworks.py 97 5 0 95%
contextualized/baselines/tests.py 36 1 0 97%
contextualized/regression/lightning_modules.py 303 10 0 97%
contextualized/analysis/pvals.py 51 1 0 98%
contextualized/analysis/tests.py 86 1 0 99%
contextualized/dags/tests.py 131 1 0 99%
contextualized/easy/tests.py 224 1 0 99%
contextualized/regression/tests.py 144 1 0 99%
contextualized/tests.py 165 1 0 99%
contextualized/init.py 7 0 0 100%
contextualized/analysis/init.py 5 0 0 100%
contextualized/baselines/init.py 1 0 0 100%
contextualized/dags/init.py 7 0 0 100%
contextualized/dags/losses.py 26 0 0 100%
contextualized/dags/trainers.py 6 0 0 100%
contextualized/easy/ContextualGAM.py 9 0 0 100%
contextualized/easy/ContextualizedClassifier.py 14 0 0 100%
contextualized/easy/init.py 4 0 0 100%
contextualized/easy/wrappers/init.py 1 0 0 100%
contextualized/modules.py 65 0 0 100%
contextualized/regression/init.py 10 0 0 100%
contextualized/regression/losses.py 7 0 0 100%
contextualized/regression/metamodels.py 76 0 0 100%
contextualized/regression/trainers.py 19 0 0 100%
contextualized/utils.py 21 0 0 100%
Total 2639 333 0 87%
cnellington commented 2 months ago

@holl- @pescap You can expect an update to the website/documentation in the next few days to better introduce common use cases and model parameters.

Regarding specific comments about naming conventions and contextualized.easy, we currently have a substantial user base and cannot rename these modules or classes, but we will explain the modules and their functionality better in our initial demos.

This naming convention is a result of our development process over the past few years. Originally, we planned to make the PyTorch native API user-facing, with the SKLearn-style easy module as a complementary tool, but the easy modules got much better reception from our user base. We've consolidated around this direction, so the easy module and related analysis utilities have become our public-facing API but the naming conventions have stuck.

holl- commented 2 months ago

@cnellington Good to see the progress! While I still don't like the naming convention, I see your point and I'll accept the current namespaces if they are sufficiently documented.

fabian-s commented 2 months ago

@cnellington @holl- @pescap : thanks for these updates and comments, JOSS greatly appreciates your transparency and responsiveness.

@holl- @pescap : please remember to update the checkmarks in your checklist as issues are being solved. if necessary, please briefly clarify which points remain open and why / what still needs to be adressed.

cnellington commented 2 months ago

@holl- @pescap Documentation has been completely updated with intro tutorials, parameter descriptions, motivations and use cases, and API reference. https://contextualized.ml/docs/intro.html

I believe this takes care of all outstanding reviewer requests. Please let us know if you have any other comments or questions.

holl- commented 2 months ago

@cnellington Check out my comment on the Easy Demo thread! Otherwise, I'm happy with the changes!

pescap commented 2 months ago

@holl- @pescap Documentation has been completely updated with intro tutorials, parameter descriptions, motivations and use cases, and API reference. https://contextualized.ml/docs/intro.html

I believe this takes care of all outstanding reviewer requests. Please let us know if you have any other comments or questions.

Great! I have no further comments. Thank you!

holl- commented 1 month ago

@fabian-s All my concerns have been addressed. I've completed my checklist and have no further comments. I recommend accepting the paper.

fabian-s commented 1 month ago

@editorialbot generate pdf