openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
694 stars 36 forks source link

[REVIEW]: Diart: A Python Library for Real-Time Speaker Diarization #5266

Closed editorialbot closed 2 days ago

editorialbot commented 1 year ago

Submitting author: !--author-handle-->@juanmc2005<!--end-author-handle-- (Juan Manuel Coria) Repository: https://github.com/juanmc2005/StreamingSpeakerDiarization Branch with paper.md (empty if default branch): joss Version: v0.9.1 Editor: !--editor-->@Fei-Tao<!--end-editor-- Reviewers: @sneakers-the-rat, @mensisa Archive: 10.5281/zenodo.12510278

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/cc9807c6de75ea4c29025c7bd0d31996"><img src="https://joss.theoj.org/papers/cc9807c6de75ea4c29025c7bd0d31996/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/cc9807c6de75ea4c29025c7bd0d31996/status.svg)](https://joss.theoj.org/papers/cc9807c6de75ea4c29025c7bd0d31996)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@sneakers-the-rat & @mensisa, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @Fei-Tao know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @mensisa

📝 Checklist for @sneakers-the-rat

editorialbot commented 1 year ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 1 year ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.05 s (552.7 files/s, 70207.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          23            459            689           1882
Markdown                         2            102              0            366
TeX                              1              3              0             27
YAML                             1              1              4             18
TOML                             1              0              0              6
-------------------------------------------------------------------------------
SUM:                            28            565            693           2299
-------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 1 year ago

Wordcount for paper.md is 459

editorialbot commented 1 year ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/ASRU51503.2021.9688044 is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 1 year ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mensisa commented 1 year ago

Review checklist for @mensisa

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

sneakers-the-rat commented 1 year ago

(pinging to say sorry i am so delayed, working on a time-sensitive writing project and this is on my list for when i finish that this week)

Fei-Tao commented 1 year ago

Hi @sneakers-the-rat, any updates on the review? Please feel free to let us know if you need any help. Thanks again for your time.

sneakers-the-rat commented 1 year ago

Sorry, have been having a rough few months. just getting back up to speed with my work responsibilities, and I think i'll be able to get to this next week. Sorry for the extreme delay

sneakers-the-rat commented 1 year ago

Again extremely sorry, working on this now.

sneakers-the-rat commented 1 year ago

Review checklist for @sneakers-the-rat

Conflict of interest

Code of Conduct

General checks

General comments: One minor nitpicky thing I would say here is that it would be good to get the images and paper pdf out of the root of the repo as it makes the package feel less tidy than it actually is! other than that looks good!!!

Functionality

General comments: Code runs! joy!

Documentation

Collected docs issue here - https://github.com/juanmc2005/diart/issues/176

Additional thoughts on docs:

Software paper

Summary

Thanks for the very lovely package! Code structure is great, clearly written and commented, and all the cli entrypoints run out of the box with very little fuss (aside from that described in the README for installation, but thats totally fine).

I get the feeling that this is an extension of a prior package, so I'm not quite sure what the expectations for documentation and testing should be - the code itself is readable, the examples seem to cover major functionality, and there are some docstrings, but there isn't anything you would call "API documentation" here.

The most glaring thing missing here is a test suite. The package seems to run fine from what I could tell, but I am unable to evaluate the accuracy of the results because I don't know that the intermediate functions/methods are well tested. I can sympathize with the author in not wanting to write them (again, particularly if this is an extension package to a well-tested prior package), but requiring tests is about more than just making sure the code stays working, but that it is maintainable and possible to contribute to for the overall health of the scientific programming ecosystem - if the author gets busy with life and has to drop maintenance, would anyone else be able to pick it up? will anyone be able to run this in 5 years? tests make that question a lot easier to answer. I'm not sure how the other reviewer passed this checkmark because they're not there!

It doesn't necessarily bear on the review process at JOSS, but I do feel like I need to say that I don't love needing to create an account on huggingface.co to test the package, and I am skeptical about making packages dependent on some cloud service existing for free indefinitely (i am aware GitHub is also a platform, but hopefully git makes redundant enough copies of things, and archiveteam is on that), so i would love to see the authors host the model weights elsewhere as another option, even on someplace like zenodo would be great. The models are quite small and would even fit in this repository, not sure why they aren't. Again this doesn't bear on this review process at all.

That said, there is a lot to love here (one minor ux thing, it's rare to see progress bars for multiprocessing tasks done well. The model provisioning worked smoothly once I installed the huggingface-cli, which was lovely (i usually hate that part). I also loved the live plot of the results when I was using my mic). Overall this is a well executed tool that does exactly what it describes. I wish I had more time to spend with the code (i know this will read as ironic given how long it took me to actually get to the review), as I usually like to hunt around for code improvements that the author can make to provide whatever training or tips i can, but it's also relatively clear they don't need it as this package is very well written. I don't believe in reviewing-as-gatekeeping, so every review is a "pass" from me. My comments here, and boxes left unchecked above are to indicate to the authors where things could be improved and to the editor if they need to uphold the checklist.

Issues Raised

Pull requests

sneakers-the-rat commented 1 year ago

stalled on https://github.com/juanmc2005/diart/issues/158 - can't run program

arfon commented 9 months ago

@juanmc2005 @Fei-Tao – it looks like @sneakers-the-rat is blocked by https://github.com/juanmc2005/diart/issues/158 – could you agree a plan to move this forward please?

sneakers-the-rat commented 9 months ago

I am very sorry for the even more extreme delay than last time. Still on my TODO but keeps getting shunted down after prepping for a conference. Ill be releasing a package early next week after which ill make time BC this has.gone on too long already!

Fei-Tao commented 9 months ago

Hi @arfon, thanks for your follow-up with this submission. Since sneakers-the-rat has plan to finish reviewing this submission, can we wait for his response?

@sneakers-the-rat No worry. That's understandable. Thanks for your time in reviewing this submission. I'm looking forward to hearing your comments and suggestions. Please feel free to let us know if you need any help.

arfon commented 9 months ago

Hi @arfon, thanks for your follow-up with this submission. Since sneakers-the-rat has plan to finish reviewing this submission, can we wait for his response?

Of course! Just wanted to prompt to see what is happening here.

Fei-Tao commented 9 months ago

@arfon Great. Thanks for your prompt response!

sneakers-the-rat commented 9 months ago

OK! finished my review, hopefully my general review at the bottom answers any questions, and open issues and pull requests are suggestions! I'll leave any further decision up to the editor here, i'm a "pass" on this one since i view my role as a reviewer as helping the authors improve the package, not determining whether it should be in or out of an exclusive club <3. Lovely package, thanks for the opportunity to look at it, and again sorry it took so long. Cheers to the authors!

juanmc2005 commented 9 months ago

@sneakers-the-rat thank you for your invaluable feedback!

sneakers-the-rat commented 9 months ago

btw i don't mean to presume but if you need any help setting up docs, tests, or want examples just lmk!

juanmc2005 commented 9 months ago

@sneakers-the-rat some help with the docs/tests would be very much appreciated if you have the time!

Fei-Tao commented 8 months ago

Hi @sneakers-the-rat, thanks again for your time to review this submission. Would you please complete the review at your convenience? Please let us know if you need any help.

sneakers-the-rat commented 8 months ago

I did complete the review!

My comments here, and boxes left unchecked above are to indicate to the authors where things could be improved and to the editor if they need to uphold the checklist.

Review passes from me, but boxes left unchecked are just to indicate things that are not technically met and was leaving that up to you and the author about what to do

Fei-Tao commented 8 months ago

Hi @sneakers-the-rat sorry for the confusion. Thanks for your prompt response! @juanmc2005 would you please address the above issues?

juanmc2005 commented 8 months ago

Hi @Fei-Tao @sneakers-the-rat, yes I will prioritize those issues.

I'm copying the list here so I can update the progress:

sneakers-the-rat commented 8 months ago

I'm conference prepping atm, but when i return i can help get these last things in place.

est. time, 1h if you've never done it - for docs, you already have docstrings, so to avoid you needing to copy/paste those into README i think the simplest thing would be to do a minimal sphinx boilerplate docs site that takes what you've already written in the README (which is totally sufficient for user docs) and then adds autodoc calls to document the classes/functions. see here for sphinx quickstart, here for readthedocs, here for example of literal include README, and here for an example docs repo with api docs.

est. time probably 2h if you've never done it - JOSS doesn't require CI, just existence of tests, so rather than requiring you to write full unit tests for every part of the package, why not just wrap your existing benchmark function in a pytest test? See here where i've made a simple example to test the config - https://github.com/sneakers-the-rat/diart/commit/b1a0ccaa35f8b36aa30f978a4bcb16db69652a42 - so then since you already have your expected outputs in the repo, just run the benchmark and check that the outputs of the benchmark match the expected output. That's not a very sensitive test like you would want out of a test suite, but it will tell you if something somewhere in the whole pipeline is broken.

est. time 15m - contributor docs are an important part of setting the social climate of a package, but at this stage it's not mission critical and you have plenty of time to establish norms as you go. reading and including eg the contributor covenant and some simple instructions on how you'd like to receive contributions (eg. search for issues first, open a discussion, raise an issue, make a PR to the dev branch, etc.) would totally suffice for the sake of completing the checklist.

So close! Sorry for the measly last things, but these will hopefully help keep the project maintainable in the longterm, and good practice for every package!

juanmc2005 commented 7 months ago

@sneakers-the-rat the documentation page is now live at https://diart.readthedocs.io/

juanmc2005 commented 7 months ago

@sneakers-the-rat concerning the tests for "benchmark" with expected outputs, it's not as straightforward. Expected outputs match the implementation used in the paper, which is not exactly the same as diart (improvements were made since then). Even then, changes in the future might break that test without it being something bad.

Instead, I would focus on covering (at the very least) all the main functionality in diart.blocks (i.e. the __call__ method of each block class).

sneakers-the-rat commented 7 months ago

You know what would be better to test than I would! Lmk if you need any guidance writing those :)

Kevin-Mattheus-Moerman commented 5 months ago

@Fei-Tao please check in here about twice a week to keep the review going.

Fei-Tao commented 4 months ago

Hi @juanmc2005, I hope this message finds you well. Do you have any updates on this submission?

juanmc2005 commented 4 months ago

Hi @Fei-Tao, I started writing some tests but I've been extremely busy at work so I haven't had the time to open that last PR. I will try to prioritize it so the submission can move forward

crvernon commented 2 months ago

@Fei-Tao please see if we can get this one moving along, otherwise we may need to consider a pause or withdraw option. Thank you!

sneakers-the-rat commented 1 month ago

Reiterating my willingness to help with tests if the authors lmk what the barriers are here :). Basically I think a simple set of end to end tests that validate the core behavior of the software would be enough for me, not asking the authors to write an entire unit testing suite from the ground up - to me reviewing is both about ensuring some minimum standard of maintainability and functionality for scientific software, but also about teaching and sharing best practices, so am more than happy to help!

Fei-Tao commented 1 month ago

Hi guys, I was traveling for the past two weeks and had limited access to my laptop. Sorry to reply late. @sneakers-the-rat thanks for your comments and advice for this submission. That is very helpful to improve this submission. @juanmc2005 would you please address the issues commented above at your convenience? This submission review has taken too long. If we still have not received your feedback, we may need to consider pausing or withdrawing it. Thanks for your time.

juanmc2005 commented 1 month ago

@Fei-Tao @crvernon I was initially thinking of adding unit tests for the entire library, which is why it's been taking me so long. However, as @sneakers-the-rat pointed out, it seems that what's required is a simple suite of end-to-end tests for the main functionality.

As I mentioned before, I think this end-to-end testing is covered by the diart.benchmark CLI, the hyper-parameter values we provide for each dataset, as well as the expected performance that should be observed. It is not possible to do this automatically for most of the datasets because they're not freely available or too big to be practical. However, I could write a document with step-by-step instructions to test the library on AMI (including how to get the data) and showing the expected performance, or even add an option --test AMI to download the data automatically before running.

Once again, the core of this testing is already implemented by diart.benchmark and the required configuration is given in the README under the reproducibility section, with the exception of downloading the data.

Would that be an acceptable solution?

sneakers-the-rat commented 1 month ago

I think you should add some sample data that has a known output value and add a simple test with pytest that confirms that the benchmark function reproduces that output :). If you wanted to add unit tests, that only is to your benefit of course, and I would love to see that, but I wouldn't hold the paper up over that. Ideally you would also have CI but the checklist doesn't require that.

like i mentioned here: https://github.com/juanmc2005/diart/issues/175 and in the checklist above, i was able to get output values that were close to those in the paper for one of the datasets, but it required a decent amount of digging around to find the dataset, so for the sake of testing you definitely want to have something small that tests the basic functionality of the package without needing to do a bunch of manual steps to execute the test.

So i would do something like find a 5-10 second sample of particularly challenging audio, run the benchmark over it with all the permutations of the settings you want to test for, confirm that those are correct, and then save those as expected outputs in the tests directory as well.

eg. (some pseudocode bc i'm not going to bother running this right now) assuming the sample is in tests/data/audio/sample.mp3 and the outputs are in tests/data/rttm and this test is tests/test_benchmark.py, here is how you would test a range of audio samples and latencies

expand/collapse test example ```python import pytest from pathlib import Path from pyannote.database.util import load_rttm from diart.inference import Benchmark, Parallelize from diart import SpeakerDiarization, SpeakerDiarizationConfig from diart.models import SegmentationModel DATA_DIR = Path(__file__).parent / 'data' BATCH_SIZE = 32 @pytest.fixture(scope="session") def model(): model_name = "pyannote/segmentation@Interspeech2021" return SegmentationModel.from_pretrained(model_name) @pytest.fixture(scope="session") def make_config(model): def _config(latency) return SpeakerDiarizationConfig( segmentation=model, step=0.5, latency=latency, tau_active=0.555, rho_update=0.422, delta_new=1.517 ) return _config @pytest.mark.parametrize('source_file', (DATA_DIR / 'audio').glob('*.mp3')) @pytest.mark.parametrize('latency', [0.5, 1, 2, 3, 4, 5]) def test_benchmark(make_config, source_file, latency): config = make_config(latency) pipeline = SpeakerDiarization(config) padding = pipeline.config.get_file_padding(filepath) source = src.FileAudioSource( source_file, pipeline.config.sample_rate, padding, pipeline.config.step, ) pipeline.set_timestamp_shift(-padding[0]) inference = StreamingInference( pipeline, source, BATCH_SIZE, do_profile=False, do_plot=False, show_progress=False ) pred = inference() expected_file = (DATA_DIR / 'rttm' / str(latency) / source_file.name).with_suffix('.rttm') expected = load_rttm(expected_file) assert pred == expected ```
juanmc2005 commented 1 month ago

@sneakers-the-rat thanks for your answer! I can add a test case like this. My only concern is that I don't necessarily want to check for "expected" outputs because what's expected is unclear. There are no "correct" outputs in practice, only if they are perfect like a human annotation would be, and no diarization system can produce human-level outputs yet.

In this context, it's unclear to me what the purpose of the test would be. As soon as an improvement is made on the core algorithm (e.g. clustering) to make it more accurate, the test will become useless because the outputs might be better this time, or they might be worse for the test file but overall better for a benchmark dataset (that is what users will actually care about). This is actually what happened already with the expected outputs from the paper that I provided at the very beginning, which were improved upon later. The end result would be that each time I upgrade the library I will also have to update the expected outputs, so it feels like I'm writing a test just for it to pass.

I understand it could be useful to verify that small changes outside the core features don't break a specific version. But a more robust test is to actually run a benchmark on one of the freely available target datasets like AMI, because the outputs might be the same for one file but might break in others.

In any case, I'll go ahead and work on this end-to-end test next weekend. I'll keep you updated about the progress.

juanmc2005 commented 1 month ago

@sneakers-the-rat concerning the CI, I already included github actions for linting and quick testing that the main operations run on both python 3.8 and 3.10 on every PR. There's also a github action for automatic deployment to pypi when creating a release.

sneakers-the-rat commented 1 month ago

This is totally fair and a very good point!

I feel multiple ways about this: on the one hand, I dont want to police what "counts" as a test, and the JOSS reviewer guidelines dont describe any standard for coverage, robustness, or completeness of tests (they probably should, but they dont). I also dont want to demand you write a whole ton of code for the sake of this review which has already gone on long enough. On the other hand, I do feel like the purpose of review is to help the software reach some minimal state of maintainability and reliability and that includes testing for core functionality.

It is very true that "snapshot" based end-to-end tests are some of the weakest tests you can write, and that they are mostly useful when accompanied by unit and other more specific tests. They are not entirely pointless, though, as they do signal that something about the program has changed - eg. If your upstream model changes in such a way that it breaks (instead of getting better), you would want to know that even if nothing changed in your code. So while the snapshot style tests are not ideal, they are relatively near at hand without asking you to write a ton of unit tests while still covering core functionality.

I did see the CI action that runs the benchmark, but testing for execution is a few steps weaker still (though still not nothing!) s.t. I dont think they help the package be a maintainable and reliable part of open source tooling - eg. If you accept a large PR and it accidentally starts adding 10 seconds to all timestamps, those tests would still pass even if it breaks the tool.

My example was just a quick illustration and I take your point about the moving target - the key point is having known input and output. So an ideal case would be one where you definitely know the correct times (and the audio signal is still naturalistic, so its not "too easy" bc youre right, you do want reasonable coverage of the input space bc some samples may fail while others still work), and then you would do something like set a threshold like "do the estimated times fall within some threshold window that I would still call correct." If the model gets more accurate, great, that should just push the values closer to the center of that expected window, but you wouldnt have to fiddle with the expected values for some trivial ms change.

Re: the form of the tests, I do think that using a tool like pytest is preferable vs. A simple CLI test because it gives you space to grow for future testing, makes for a better developer experience for potential contributors, and allows you to do more complex tests eg. On the product of conditions audio samples x configuration options as I demo in above comment. The CI action would then just be to run pytest. If you prefer to just have tests just as a CLI call I definitely wouldnt force you to use pytest, so consider that a strong but friendly recommendation to bring the package in line with best practices ♥

To avoid cluttering the review issue, if theres more discussion to be had lets move this into an issue on diart. Like I said, good and fair points! Not trying to shut down conversation just keeping things tidy given github's lack of threading on issues

juanmc2005 commented 1 month ago

@sneakers-the-rat I just opened the following PR with some unit tests and the end-to-end test we discussed about: https://github.com/juanmc2005/diart/pull/237

I also added a GitHub action to automatically run pytest on all PRs to main and develop

sneakers-the-rat commented 1 month ago

Fabulous, that looks great!!!! I think it will pay off, nice attention to detail on the fixtures and etc. Im good to go on tests. Let me finish my review this evening :)

juanmc2005 commented 1 month ago

thanks! I'll go ahead and merge the PR. It will be included in the next release. I think all review boxes have been checked now 🙂

juanmc2005 commented 1 month ago

@sneakers-the-rat have you had some time to finish the review?

sneakers-the-rat commented 1 month ago

Im on strike at the moment. Review passes from me but am not doing any work like reviewing until UCLA and UC resolve the strike, hope you understand.

crvernon commented 1 month ago

@Fei-Tao - it looks like @sneakers-the-rat has approved the review and @mensisa has completed the majority of their checklist. Please check in on this one regularly to get it across the finish line as it has been in review for a while. Thank you!

Fei-Tao commented 1 month ago

@crvernon Thanks for your reminder! @mensisa @sneakers-the-rat Thanks for your constructive comments and suggestions for this submission! They are helpful in improving this submission.

Fei-Tao commented 1 month ago

Post-Review Checklist for Editor and Authors

Additional Author Tasks After Review is Complete

Editor Tasks Prior to Acceptance

Fei-Tao commented 1 month ago

@editorialbot check references

editorialbot commented 1 month ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/ASRU51503.2021.9688044 is OK

MISSING DOIs

- No DOI given, and none found for title: pyannote.audio: neural building blocks for speaker...
- No DOI given, and none found for title: Optuna: A Next-generation Hyperparameter Optimizat...
- No DOI given, and none found for title: Ego4D: Around the World in 3,000 Hours of Egocentr...

INVALID DOIs

- None