[REVIEW]: EspressoDB: A scientific database for managing high-performance computing workflows

whedon commented 4 years ago

Submitting author: @ckoerber (Christopher Körber) Repository: https://github.com/callat-qcd/espressodb Version: v1.1.0 Editor: @gkthiruvathukal Reviewer: @remram44, @ixjlyons Archive: 10.5281/zenodo.3677432

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/d0342f15684b9a464faed7c59784f734"><img src="https://joss.theoj.org/papers/d0342f15684b9a464faed7c59784f734/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/d0342f15684b9a464faed7c59784f734/status.svg)](https://joss.theoj.org/papers/d0342f15684b9a464faed7c59784f734)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@remram44 & @ixjlyons, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @gkthiruvathukal know.

✨ Please try and complete your review in the next two weeks ✨

Review checklist for @remram44

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@ckoerber) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @ixjlyons

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@ckoerber) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @remram44, @ixjlyons it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 4 years ago

Reference check summary:

OK DOIs

- 10.1109/sc.2018.00054 is OK
- 10.1109/SC.2018.00060 is OK
- 10.1038/s41586-018-0161-8 is OK
- 10.1103/PhysRevLett.121.172501 is OK
- 10.1051/epjconf/201817509007 is OK
- 10.1103/PhysRevD.82.094502 is OK

MISSING DOIs

- https://doi.org/10.1109/sc.2018.00058 may be missing for title: Simulating the weak death of the neutron in a femtoscale universe with near-Exascale computing

INVALID DOIs

- None

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

remram44 commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

gkthiruvathukal commented 4 years ago

@remram44, @ixjlyons: Just checking on progress with this review.

remram44 commented 4 years ago

@gkthiruvathukal underway, sorry about the delay! I'm confident I can get it done next week.

gkthiruvathukal commented 4 years ago

@remram44 Not a problem! This was intended as a gentle nudge. Thanks for your help!

ixjlyons commented 4 years ago

Same here -- I've made progress but haven't updated my checklist. I should be able to finish in the next week or so as well. Thanks for the reminder.

gkthiruvathukal commented 4 years ago

Thanks for the updates, @ixjlyons and @remram44!

remram44 commented 4 years ago

The software side looks fine to me! There are automated tests, extensive documentation including examples.

One missing item is the "community guidelines" explaining how to contribute, but that's easy to add. (I don't know if you need one, but it's on the JOSS checklist :wink:)

Here are my remarks looking around the documentation and code:

there is a branch v1.0.0 with the same name as the tag v1.0.0, which chokes Git sometimes (it doesn't know which one you mean, even though they point to the same commit). You should probably remove the branch (to avoid an error message when deleting it, use the GitHub UI or the command git push origin :refs/heads/v1.0.0)
Git tag names are inconsistent (0.1.0, 0.2.1, v1.0.0). You should stick with either using the prefix or not. The other tags were < 1.0 so if you're just using the prefix from now on I don't think you absolutely have to fix the older tags.
the blackmagicsorcery module stands out a bit, especially now that it just re-exports Python's standard re module (since 31c69150). You should probably just import re directly.
the docs use both RST and MD files (but they seem to render correctly on ReadTheDocs)

Regarding the paper markup, it looks fine, although you probably don't need to monospace "Django" since it's the official name of the project, not just a package import name. "Leadership computing facilities" should be "Leading computing facilities".

My main concerns are about the motivation for the project. In particular, I am not sure of the value added compared to Django, which provides a lot of the claimed features of EspressoDB, some of which being just re-exported, such as database management, signals, and automatically creating views from models (Django has the "admin" pages).

It seems to me that it might be more useful if it was easier to integrate with existing data and processes? Creating a full Django project is heavy (project folder, app folder, migrations, ... are still there and need to be dealt with manually).
Maybe it should have more features to keep track of experiment runs and data dependencies (e.g. which data was created from which version of which data), kind of like a workflow system? The way data is added in the example seems kind of dangerous, with no versioning. This example update script has to manually check the data, counting the records to figure out if the data is up to date or not.
It's not clear how EspressoDB helps "centralize and guarantee" data integrity from my reading of the docs. When multiple versions of the code are around with multiple people working on it, how does this code interact with a centralize database? How does the system deal with revisions of the code? Can it guarantee integrity in the presence of buggy code, which might for example attempt to delete everything?
Also note that LatteDB is the only live example in the paper, and it doesn't let me see any data (I get a login form), though it lets me open a submission form. This doesn't seem in line with "supports [...] open-data oriented projects", although I'm sure you might have your reasons for this configuration.

There is also no related work in the paper, although data management and workflow management are big fields. Some things that come to mind:

data management systems, such as DVC, which people might use to fill the same need
workflow management systems (Taverna, Galaxy, VisTrails, ...) which seem very relevant after reading your Summary section
- full disclosure: I worked on VisTrails, though the project is now dormant
data repositories: if hosting data is the primary function, adding features to an extensible data repository system naively seems like a good approach, compared to building from a generic web framework like Django (even with the added features of EspressoDB)?

ixjlyons commented 4 years ago

I've completed my review. Overall, EspressoDB provides a fairly rich set of features for relatively little programming effort. It seems to fulfill an important role in managing complex datasets while encouraging documentation (by making it simple to generate), which is often overlooked. The documentation of EspressoDB itself is fairly thorough and provides a straightforward path to getting started and moving on to more advanced usage. I think it fits into the scope of JOSS and needs just a few improvements in my opinion.

I ended up filing one issue regarding some development installation instructions in the README and I also made a pull request with some typo/grammar fixes in the docs. Aside from that, the following are minor issues I noticed that I didn't feel warranted actual issues against the repository:

Links to Django docs are inconsistent, sometimes using dev docs and sometimes a specific version (e.g. v2.2 in Usage.md). May want to update for consistency.
In the example/project-creation/class-interface doc, python manage.py shell is introduced and indicates you'll get an ipython shell, however ipython isn't in the example requirements.txt. Consider adding a note in the doc and/or adding ipython to the example dependencies.

Minor grammatical issues in the paper:

First line of the "Use case" section is a fragment. Consider using something like: "LatteDB, an application of ... calculations and analysis, is currently being..." or "LatteDB is an application of ..."

Also in use case section: "ultimately processed down to hundred of..." -> "hundreds of".

Consider: "...status of these files in real-time (identify corrupt..." -> "...status of these files in real-time to identify corrupt..."

I've so far left a few checklist items un-checked. Here is some rationale for these:

Statement of need (both the documentation and paper): I think this is mostly in place, but I'm not sure I completely understand the use case(s) for EspressoDB. See below for a bit more on this.
Community guidelines: this seems to be missing. A few sentences in the README or a contributing doc would suffice just to point out how to get help, report issues, and contribute.
State of the field: I'm not all that familiar with related tools, but I expected some discussion of other data management solutions (e.g. Data Version Control). If it's only to mention why existing tools don't suit the needs of computational physics and such, that's fine.

Finally, I have a couple more open-ended thoughts you might consider.

There seems to be a pretty heavy reliance on users coming in with some understanding of Django to do anything nontrivial. This isn't necessarily a bad thing, but it could be stated early in the documentation that this is the case. EspressoDB (probably rightly) doesn't attempt to abstract away from Django to avoid this, so I think users should be informed up front.

A potential issue I see with this framework is that it seems to bring together system administration and data processing in a way that I'm not sure is ideal. Perhaps some explanation of how LatteDB (or a hypothetical example instead) is implemented could be sufficient. I'm left wondering, for example, who manages the system and who uses it to do computational work. If everyone using the system (i.e. writing data processing code) needs know Django and/or web development, the applicability of the framework may be somewhat limited. The paper and/or docs might benefit from some explanation of a reasonable workflow involving multiple team members with different roles (e.g. non-scientific administrator, computation-focused programmers, scientists pulling data to do local analyses, etc.).

ckoerber commented 4 years ago

Hello @gkthiruvathukal, @ixjlyons, and @remram44,

Thank you for your time and the feedback you have provided. We believe that your comments help to improve EsspressoDB. To keep updates transparent, we filed issues for suggested changes and intend to merge them into the new version v1.1.0. The filed issues are collected in a new project on EsspressoDB. We believe that we should be able to finalize the new features by the end of next week.

In the next days, we will also address the points you have made in more detail.

General statements regarding the open checkboxes:

We will add community guidelines in the next release.
We will add a statement addressing the state of the field in both documentation and paper.
The publication of EspressoDB goes hand-in-hand with LatteDB. Because LatteDB is domain-specific (Lattice Quantum Chromodynamics), we decided to submit EspressoDB as it might be more beneficial to a broader audience. In our paper, we rather address the need for EspressoDB from our science domain by pointing out the challenges of Lattice computations.

For example

... CalLat creates petabytes of temporary files that are written to the scratch file system, used for subsequent computations and ultimately processed down to hundred of tera-bytes that are saved for analysis. It is essential to track the status of these files in real-time (identify corrupt, missing, or purgeable files).

To address the statement of need, should we be more explicit about how LatteDB (and thus EspressoDB) helps to track files or rather approach it from a more general point of view?

gkthiruvathukal commented 4 years ago

@gkthiruvathukal Just letting @ckoerber and all know that I'm keeping an eye on the thread. It sounds like the feedback from @ixjlyons, and @remram44 has been well received and there is a plan to work on the issues raised during review.

After the issues are addressed, I'll have the reviewers take another look.

I'd like to ask for everyone's help here. I have no reason to doubt that all authors of the software are represented on the paper submission, but can each of you (authors and reviewers) please confirm for me?

"Does the full list of paper authors seem appropriate and complete?"

Yes, I know the checkbox has been checked but am asking you to check once more. I am in the middle of dealing with another submission (not edited by me) where the answer is "no" so I am now checking every one of my editorial assignments to make sure there are no authors--or potential authors--who are not listed. If you can do a brief follow-up, this would be much appreciated.

remram44 commented 4 years ago

I confirm that @ckoerber made major contributions to the software, and that the list of authors match the major contributors of the software.

It seems that @cchang5 goes by "Jason Chang" on his GitHub profile and "Chia Cheng Chang" on the paper, but it looks like this is intended, since he personally committed to the paper itself.

cchang5 commented 4 years ago

@gkthiruvathukal I'd like to ask for everyone's help here. I have no reason to doubt that all authors of the software are represented on the paper submission, but can each of you (authors and reviewers) please confirm for me?

I am confirming that I am an author of the software.

ckoerber commented 4 years ago

I can confirm that @cchang5, @walkloud, and I are the authors of EspressoDB (and LatteDB).

ixjlyons commented 4 years ago

Confirming I've re-checked the author list against the contributors based on git history.

gkthiruvathukal commented 4 years ago

Thanks for all responses!

gkthiruvathukal commented 4 years ago

@ckoerber Just checking on the status to address the review feedback. The next step will be for the reviewers to confirm that the feedback has been addressed to their satisfaction. Then I will be in a position to make my recommendation.

ckoerber commented 4 years ago

Hello @gkthiruvathukal, we anticipate to respond in detail at the beginning of next week.

ckoerber commented 4 years ago

Hello @gkthiruvathukal, @ixjlyons, and @remram44,

We would like to thank you for your feedback and patience. We have posted detailed responses to both referee replies as issues on the EspressoDB repo

Furthermore, we have created a Pull Request which contains the updates to the paper, documentation, and new features we have introduced to address concerns made by the referees.

We intend to merge this branch into master once the second review iteration is finalized. Is this in accordance with the JOSS guidelines?

Feel free to contact us if you have any questions.

Best regards,

@cchang5, @ckoerber, @walkloud

gkthiruvathukal commented 4 years ago

@ckoerber, @cchang5, and @walkloud, thank you for responding to the review feedback.

@ixjlyons and @remram44, please let me now if the team has addressed your feedback in a satisfactory manner. Then I can proceed to the next phase (acceptance).

ixjlyons commented 4 years ago

@whedon generate pdf from branch v1.1.0

whedon commented 4 years ago

Attempting PDF compilation from custom branch v1.1.0. Reticulating splines etc...

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

ixjlyons commented 4 years ago

The authors have addressed my comments thoroughly. I looked over the updated and rendered paper from the v1.1.0 branch and updated my checklist. I recommend the submission be accepted.

gkthiruvathukal commented 4 years ago

@ckoerber, I think I am ready to move toward acceptance. Can you please do the following? Please just follow up with comments for each item. I will then check off the boxes.

[x] Make a tagged release of your software, and list the version tag of the archived version here.
[x] Archive the reviewed software in Zenodo
[x] Check the Zenodo deposit has the correct metadata, this includes the title (should match the paper title) and author list (make sure the list is correct and people who only made a small fix are not on it); you may also add the authors' ORCID.
[x] List the Zenodo DOI of the archived version here.

remram44 commented 4 years ago

I completed my checklist after checking out https://github.com/callat-qcd/espressodb/pull/48, thanks!

Note that the list of workflow systems I knew off the top of my head might not be the most relevant systems (as I said, I used to work on one of them, but I've been out of this space for a few years. In particular, I don't think those tools are tailored to HPC. REANA and Galaxy are other names I remember).

Also there are probably academic publications for some systems which might be better citations than the URLs (Taverna has this page for example).

ckoerber commented 4 years ago

Hello @gkthiruvathukal, @ixjlyons, and @remram44,

We are happy that EspressoDB was accepted for publication and would like to thank you for your time and efforts.

The tagged release is v1.1.0 and the zenodo doi is 10.5281/zenodo.3676074 for all versions and 10.5281/zenodo.3677432 for the one associated with the tag. We have verified that the Zenodo metadata matches the JOSS metadata and added ORCIDs.

gkthiruvathukal commented 4 years ago

Thanks, @ckoerber, for the follow up. I will work on the final steps and signal the EICs when done.

gkthiruvathukal commented 4 years ago

@whedon set v1.1.0 as release

whedon commented 4 years ago

I'm sorry human, I don't understand that. You can see what commands I support by typing:

@whedon commands

gkthiruvathukal commented 4 years ago

@whedon commands

whedon commented 4 years ago

Here are some things you can ask me to do:

# List all of Whedon's capabilities
@whedon commands

# Assign a GitHub user as the sole reviewer of this submission
@whedon assign @username as reviewer

# Add a GitHub user to the reviewers of this submission
@whedon add @username as reviewer

# Remove a GitHub user from the reviewers of this submission
@whedon remove @username as reviewer

# List of editor GitHub usernames
@whedon list editors

# List of reviewers together with programming language preferences and domain expertise
@whedon list reviewers

# Change editorial assignment
@whedon assign @username as editor

# Set the software archive DOI at the top of the issue e.g.
@whedon set 10.0000/zenodo.00000 as archive

# Set the software version at the top of the issue e.g.
@whedon set v1.0.1 as version

# Open the review issue
@whedon start review

EDITORIAL TASKS

# Compile the paper
@whedon generate pdf

# Compile the paper from alternative branch
@whedon generate pdf from branch custom-branch-name

# Remind an author or reviewer to return to a review after a
# certain period of time (supported units days and weeks)
@whedon remind @reviewer in 2 weeks

# Ask Whedon to do a  dry run of accepting the paper and depositing with Crossref
@whedon accept

# Ask Whedon to check the references for missing DOIs
@whedon check references

# Ask Whedon to check repository statistics for the submitted software
@whedon check repository

EiC TASKS

# Invite an editor to edit a submission (sending them an email)
@whedon invite @editor as editor

# Reject a paper
@whedon reject

# Withdraw a paper
@whedon withdraw

# Ask Whedon to actually accept the paper and deposit with Crossref
@whedon accept deposit=true

gkthiruvathukal commented 4 years ago

@whedon set v1.1.0 as version

whedon commented 4 years ago

OK. v1.1.0 is the version.

gkthiruvathukal commented 4 years ago

@whedon set 10.5281/zenodo.3677432 as archive

whedon commented 4 years ago

OK. 10.5281/zenodo.3677432 is the archive.

gkthiruvathukal commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

gkthiruvathukal commented 4 years ago

@whedon generate pdf from branch v1.1.0

whedon commented 4 years ago

Attempting PDF compilation from custom branch v1.1.0. Reticulating splines etc...

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

gkthiruvathukal commented 4 years ago

@openjournals/joss-eics I'm recommending this paper for acceptance.

kyleniemeyer commented 4 years ago

OK, everything looks good to me!

kyleniemeyer commented 4 years ago

@whedon accept

whedon commented 4 years ago

Attempting dry run of processing paper acceptance...

whedon commented 4 years ago

Reference check summary:

OK DOIs

- 10.1109/sc.2018.00054 is OK
- 10.1109/SC.2018.00060 is OK
- 10.1038/s41586-018-0161-8 is OK
- 10.1103/PhysRevLett.121.172501 is OK
- 10.1051/epjconf/201817509007 is OK
- 10.1103/PhysRevD.82.094502 is OK
- 10.1093/nar/gkt328 is OK

MISSING DOIs

- https://doi.org/10.1109/sc.2018.00058 may be missing for title: Simulating the weak death of the neutron in a femtoscale universe with near-Exascale computing

INVALID DOIs

- None

whedon commented 4 years ago

Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/1330

If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/1330, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true

openjournals / joss-reviews