openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
703 stars 36 forks source link

[PRE REVIEW]: mmh3: A Python extension for MurmurHash3 #5487

Closed editorialbot closed 9 months ago

editorialbot commented 1 year ago

Submitting author: !--author-handle-->@hajimes<!--end-author-handle-- (Hajime Senuma) Repository: https://github.com/hajimes/mmh3 Branch with paper.md (empty if default branch): paper Version: v4.0.1 Editor: !--editor-->@drvinceknight<!--end-editor-- Reviewers: @mrshu, @JPenuchot Managing EiC: Daniel S. Katz

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/e56d0df84eb6f090c7ec821a21f3ab44"><img src="https://joss.theoj.org/papers/e56d0df84eb6f090c7ec821a21f3ab44/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/e56d0df84eb6f090c7ec821a21f3ab44/status.svg)](https://joss.theoj.org/papers/e56d0df84eb6f090c7ec821a21f3ab44)

Author instructions

Thanks for submitting your paper to JOSS @hajimes. Currently, there isn't a JOSS editor assigned to your paper.

@hajimes if you have any suggestions for potential reviewers then please mention them here in this thread (without tagging them with an @). You can search the list of people that have already agreed to review and may be suitable for this submission.

Editor instructions

The JOSS submission bot @editorialbot is here to help you find and assign reviewers and start the main review. To find out what @editorialbot can do for you type:

@editorialbot commands
editorialbot commented 1 year ago

Hello human, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 1 year ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.04 s (576.5 files/s, 94705.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                                2            253             92           1246
Python                           6            151             21            484
Markdown                         4             69              0            259
C/C++ Header                     2             74             96            238
YAML                             5             11             21            195
TeX                              1             21              0            177
TOML                             1              4              0             38
-------------------------------------------------------------------------------
SUM:                            21            583            230           2637
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 1 year ago

Wordcount for paper.md is 932

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.18653/v1/2020.findings-emnlp.445 is OK
- 10.5281/zenodo.7882688 is OK
- 10.1145/362686.362692 is OK
- 10.1109/SEQUEN.1997.666900 is OK

MISSING DOIs

- 10.1145/1553374.1553516 may be a valid DOI for title: Feature Hashing for Large Scale Multitask Learning
- 10.1201/9781003080046 may be a valid DOI for title: Probabilistic Data Structures for Blockchain-Based Internet of Things Applications

INVALID DOIs

- https://doi.org/10.1016/j.cose.2021.102209 is INVALID because of 'https://doi.org/' prefix
editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

danielskatz commented 1 year ago

👋 @hajimes - thanks for your submission to JOSS. Can you explain a bit about what part of this is your contribution vs what is the original MurmurHash3 from by Austin Appleby? I assume all the Python code is yours, but am unsure about the C code.

danielskatz commented 1 year ago

In addition, you could work on the possibly missing and incorrect DOIs that editorialbot suggests, but note that some may be incorrect. Please feel free to make changes to your .bib file, then use the command @editorialbot check references to check again, and the command @editorialbot generate pdf when the references are right to make a new PDF. editorialbot commands need to be the first entry in a new comment.

hajimes commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/j.cose.2021.102209 is OK
- 10.1145/362686.362692 is OK
- 10.1109/SEQUEN.1997.666900 is OK
- 10.5281/zenodo.7882688 is OK
- 10.18653/v1/2020.findings-emnlp.445 is OK
- 10.1201/9781003080046 is OK
- 10.1145/1553374.1553516 is OK

MISSING DOIs

- None

INVALID DOIs

- 10.5555/1577069.1755873 is INVALID
hajimes commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/j.cose.2021.102209 is OK
- 10.1145/362686.362692 is OK
- 10.1109/SEQUEN.1997.666900 is OK
- 10.5281/zenodo.7882688 is OK
- 10.18653/v1/2020.findings-emnlp.445 is OK
- 10.1201/9781003080046 is OK
- 10.1145/1553374.1553516 is OK

MISSING DOIs

- None

INVALID DOIs

- None
hajimes commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

hajimes commented 1 year ago

@danielskatz - thanks for managing my submission.

As for contributions about the C code,

mmh3module.c (1051 loc): reviewers can assume this file is basically my own contribution. It contains hasher objects (that allow incremental updating using inner buffers), as well as wrappers (interfaces) to simple hash functions. I also made a great effort to make the original library portable to be used in various platforms, even though the visible results of this effort are tiny in terms of lines.

murmurhash3.c (287 loc) and murmurhash3.h (253 loc) contain the core logic of hashing. While the attribution of these two files should be almost entirely directed to Austin Appleby, as he invented the algorithms of MurmurHash3, I also made several tweaks to these code (e.g., conversion from C++ to C99, support for big-endian platforms, and addition of helper functions for hashers).

hashlib.h (81 loc) was just taken from the repository of CPython (https://github.com/python/cpython).

hugovk commented 1 year ago

Hello! Thanks for citing https://github.com/hugovk/top-pypi-packages!

Please could you adjust my name? "Van Kemenade" is my surname:

-29 that only 0.23% of the remaining packages in the PyPI ecosystem are more popular (Kemenade
+29 that only 0.23% of the remaining packages in the PyPI ecosystem are more popular (Van Kemenade
 30 et al., 2023). According to PePy, as of May 22, 2023, the total downloads of this
-101 Kemenade, H. van, Si, R., & Dollenstein, Z. (2023). Hugovk/top-pypi-packages: Release
+101 Van Kemenade, H., Si, R., & Dollenstein, Z. (2023). Hugovk/top-pypi-packages: Release
 102 2023.05 (Version 2023.05). Zenodo. https://doi.org/10.5281/zenodo.7882688

Thank you!

danielskatz commented 1 year ago

@hajimes - thanks for your reply. Due to the relatively small amount of code, the editors will now discuss if it meets the substantial scholarly effort criterion for review by JOSS. You should hear back in a week or so.

Also, you could make the change mentioned above by @hugovk

danielskatz commented 1 year ago

@editorialbot query scope

editorialbot commented 1 year ago

Submission flagged for editorial review.

hajimes commented 1 year ago

@editorialbot check references

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/j.cose.2021.102209 is OK
- 10.1145/362686.362692 is OK
- 10.1109/SEQUEN.1997.666900 is OK
- 10.5281/zenodo.7994944 is OK
- 10.18653/v1/2020.findings-emnlp.445 is OK
- 10.1201/9781003080046 is OK
- 10.1145/1553374.1553516 is OK

MISSING DOIs

- None

INVALID DOIs

- None
hajimes commented 1 year ago

@editorialbot generate pdf

editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

hajimes commented 1 year ago

@hugovk Hello! I apologize for writing your name incorrectly. I just blindly used the BibTeX entry provided by Zenodo, but I should have checked the entry before citation.

Anyway, thank you so much for providing valuable statistics!

@danielskatz Thank you for your reply and reminder. I fixed the name of @hugovk and also updated lines 27-31 from

As of May 1, 2023, mmh3 was being downloaded more than 2 million times per month, and it ranks as the 1,027th most downloaded PyPI package (of around 450,000 projects), showing that only 0.23% of the remaining packages in the PyPI ecosystem are more popular (Kemenade et al., 2023). According to PePy, as of May 22, 2023, the total downloads of this library exceeded 78 millions.

to

As of June 1, 2023, mmh3 was being downloaded more than 2 million times per month, and it ranks as the 970th most downloaded PyPI package (of around 458,000 projects), showing that only 0.22% of the remaining packages in the PyPI ecosystem are more popular (Van Kemenade et al., 2023). According to PePy, as of June 2, 2023, the total downloads of this library exceeded 79 millions.

hugovk commented 1 year ago

No problem, thank you for updating, and I'm happy the statistics were useful!

danielskatz commented 1 year ago

@hajimes - Should Austin Appleby also be an author of this submission?

hajimes commented 1 year ago

Perhaps not? (However, if you think he should be, I'll ask Austin Appleby whether he can join as a co-author)

I mean, it's true that the current mmh3 repository contains some code that was ultimately derived from Austin Appleby's original SMHasher library (i.e., murmurhash3.c and murmurhash3.h which were derived from MurmurHash3.cpp and MurmurHash3.h, respectively)

But probably a better solution is to exclude these files, use a git submodule to refer to Appleby's repo, and then write a script that converts the original C++ files to more portable C code at compile time. In this way I'll be able to clarify the contribution of each engineer, as well as increase the total amount of my own code.

If you think it's a good idea, I'll refactor mmh3 in a few days or so, though I'm not sure updates after submission are subject to evaluation by the review committee.

danielskatz commented 1 year ago

@hajimes - please go ahead and do this, then we'll review it

danielskatz commented 1 year ago

👋 @hajimes - note that I'm waiting for your response

hajimes commented 1 year ago

I sincerely apologize for my late response.

I'll try to implement the modified version of my library soon, hopefully tomorrow, at least within this week.

danielskatz commented 1 year ago

@hajimes - Can you let us know here where this is now? I'm unsure from your issue above if you are done or still working, and if you are done, you could summarize here what you did and how things have changed from the POV of understanding your contributions and what JOSS will review.

hajimes commented 1 year ago

I am very sorry for the repeated delay. As a progress report, I uploaded the result of refactoring to the feature/refactor-submodule branch that should be near-final, where ./src/mmh3/_mmh3/refresh.py handles the conversion of the original code by Austin Appleby.

I'll recheck my code later and merge it into the master branch, and then explain the changes in detail.

Again, so sorry for causing you trouble.

hajimes commented 1 year ago

I finally uploaded the update as v4.0.1 yesterday.

Files that are mostly dependent on the original now reside in https://github.com/hajimes/mmh3/tree/master/src/mmh3/_mmh3

The project structure is roughly based on the _hacl module in CPtyhon's repo, which is used by hashlib, a part of the Python standard library. https://github.com/python/cpython/tree/main/Modules/_hacl

That is,

  1. Appleby's original C++ repo is referred by git submodule.
  2. refresh.py (that is my contribution) transforms original files into new forms to be used by this project.
  3. Newly generated C files (concretely, murmurhash3.c and murmurhash3.h) stay as a part of my repository, rather than are treated as temporary files during the build time. I chose this design because mmh3module.c depends on murmurhash3.h.

Each transform is represented by a string-to-string function, and the building process of new files is represented by a list of code fragments each of which is accompanied by these transforms.

The transforms can be loosely classified into four categories:

I am really thankful to editors and reviewers for giving me an opportunity to rethink the structure of my project.

danielskatz commented 1 year ago

@editorialbot set v4.0.1 as version

editorialbot commented 1 year ago

Done! version is now v4.0.1

danielskatz commented 1 year ago

👋 @drvinceknight - can you edit this JOSS submission?

danielskatz commented 1 year ago

@editorialbot invite @drvinceknight as editor

editorialbot commented 1 year ago

Invitation to edit this submission sent!

drvinceknight commented 1 year ago

@editorialbot set me as editor

editorialbot commented 1 year ago

I'm sorry human, I don't understand that. You can see what commands I support by typing:

@editorialbot commands

drvinceknight commented 1 year ago

@editorialbot assign me as editor

editorialbot commented 1 year ago

Assigned! @drvinceknight is now the editor

drvinceknight commented 1 year ago

@hajimes - I'll be taking this submission on as your editor. Could you take a look a this list of potential reviewers and identify a few people who would be good candidates to review this submission?

hajimes commented 1 year ago

@drvinceknight - Thank you, I'm glad you took the time to be my editor!

Considering programming languages (Python and C/C++) and topic areas (high performance computing, natural language processing, machine learning, internet of things, cybersecurity, etc.), I'd suggest the following people as good candidates to review my paper:

drvinceknight commented 1 year ago

@proycon @mrshu @jarrah42 would any of you be willing to review this submission for JOSS? The submission under consideration is mmh3: A Python extension for MurmurHash3.

The review process at JOSS is unique: it takes place in a GitHub issue, is open, and author-reviewer-editor conversations are encouraged. You can learn more about the process in these guidelines: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html

Based on your experience, we think you might be able to provide a great review of this submission. Please let me know if you think you can help us out!

mrshu commented 1 year ago

I would be happy to help with the review.

drvinceknight commented 1 year ago

Thank you @mrshu and thank you for your patience everyone.

drvinceknight commented 1 year ago

@editorialbot add @mrshu as reviewer

editorialbot commented 1 year ago

@mrshu added to the reviewers list!

drvinceknight commented 1 year ago

@innat @sergioburdisso would either of you be willing to review this submission for JOSS? The submission under consideration is mmh3: A Python extension for MurmurHash3.

The review process at JOSS is unique: it takes place in a GitHub issue, is open, and author-reviewer-editor conversations are encouraged. You can learn more about the process in these guidelines: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html

Based on your experience, we think you might be able to provide a great review of this submission. Please let me know if you think you can help us out!

danielskatz commented 1 year ago

@drvinceknight - are you able to continue looking for reviewers?