scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
577 stars 152 forks source link

[JOSS] Comments #769

Closed rcannood closed 10 months ago

rcannood commented 2 years ago

Hi all!

I'm currently reviewing anndata's submission to JOSS (openjournals/joss-reviews#4371). I think the paper is very well written and is worthy of publication at JOSS. However, in going over the reviewer checklist provided by JOSS, there are some minor outstanding issue which I cannot check off at the moment.

Below is an overview of all my relatively minor comments. I'm happy to discuss them in this thread or in a separate issue, if need be.

Robrecht


General issues

Functionality

Documentation

Software paper

falexwolf commented 2 years ago

Thank you very much for the review, @rcannood! We'll get back to you!

falexwolf commented 2 years ago

I'm sorry it takes us so long.

We need a bit more time to compile the benchmarks we have. We'll get back in a week.

rcannood commented 2 years ago

Hi Alex et al.! How are things going? Let me know if anything is unclear or a misinterpretation from my end. I'd be happy to discuss! :)

falexwolf commented 2 years ago

Robrecht!

Apologies for the slow response!

Thank you very much for the in-depth review. It definitely helped us improve the paper!

General issues

Functionality → performance → benchmarks

Documentation

Software paper

All changes discussed in this response can be viewed together in this PR: https://github.com/scverse/anndata/pull/779. The up-to-date version of the paper is in the paper branch https://github.com/scverse/anndata/pull/666, as required by JOSS.

We'd be grateful for guidance on the question regarding Figure 1! We're also happy to further adapt the paper to your suggestions!

rcannood commented 2 years ago

Thanks for your hard work @falexwolf and colleagues! I ticked off the check boxes for which I have no further questions or comments. I also updated the progress of my review checklist at https://github.com/openjournals/joss-reviews/issues/4371#issuecomment-1117684175 accordingly :)

My responses to your comments:

falexwolf commented 2 years ago

Thanks a lot for this second round, @rcannood! (Why am I so slow in responding? Sorry!)

I pinged the editor re the repo location and am talking to @ivirshup about the rest!

ivirshup commented 2 years ago

Hey @rcannood, sorry also missed the update here!

"Increasing scale of data" → the dataset used in this benchmark only contains 2300 cells

Looking into making something more up to date here!

It's very cool that you're able to track the performance of anndata in this way! However, I'm having a hard time browsing the results from this workflow. Are the results of the ASV benchmarks the one that are published in readthedocs?

They are not.

At the moment, we just treat this as tooling for development, and don't publish results. We need to get dedicated hardware set up to have meaningful historical results (since performance can be quite variable between machines otherwise).

Additionally, the benchmarks here can be quite specific, and I'm not sure they would be informative by themselves. Here is an example of output: https://www.astropy.org/astropy-benchmarks/

nlhepler commented 2 years ago

Hello all!

Thank you for the very well-written paper and similarly well-built and well-maintained anndata package. I concur with @rcannood that it is worthy of publication, after a few things have been addressed:

A couple remarks I think warrant some interaction/debate:

I also have a few other comments that I would like to share that (I do not think block publication but) might be helpful if you would like to further polish this submission or if you intend to submit a follow-up paper in the future:

Thank you all again for the opportunity to review this paper and render remarks!

Best,

Lance Hepler

falexwolf commented 2 years ago

Thank you for this in-depth review! It provides very valuable feedback!

We'll thoroughly respond, but it's likely gonna take a little time.

rcannood commented 2 years ago

Hi @falexwolf ! Any updates on this issue? (I feel Lance should have created a separate issues, so the topics can be tackled separately instead of being intermingled, but ok.) Many of my comments are relatively minor, so it'd be a pity of them to hold up the submission of AnnData to JOSS.

falexwolf commented 2 years ago

Hi @rcannood, @ivirshup and I made a list and we worked out many of them (e.g., benchmarking). We'll definitely get back.

I'm sorry that I on my end got much busier with @laminlabs in the past months, but we'll definitely not leave this unfinished.

I'll ping Isaac to have another meeting to finalize this.

falexwolf commented 2 years ago

@ivirshup & I are scheduled to meet today for the first time after 2 months. We split up the work. I'll add my responses here for now and very much hope we can finalize with another comment in this thread later today.

I'll address Lance's points in the same order and grouping as he provided them.

Regarding the text:

  1. "repo location": Fixed in all places where we could.
  2. "figure text": @ivirshup, next comment.
  3. "layers": https://github.com/scverse/anndata/pull/825/commits/15160220279e9d297824e24a2ebea8937521c437
  4. "uns": https://github.com/scverse/anndata/pull/825/commits/e9a5b02ff5b69f13ef8007d9f6391f24206dc9e2
  5. "Why are one- and multi-dimensional annotations separated in two fields?": Two reasons are a. Usage patterns: One-dimensional annotations are used differently than multi-dimensional annotations. They always serve as labels with some semantic content (even if numeric), while multi-dimensional annotations store representations of the data that need additional algorithms for interpretation (assigning semantic content). Having two slots for grouping things that are used for similar things seemed like a good UX (making a few canonical things easy, like, "benchmark these alternative latent representations", "combine these labels into one", "fit a model on these conditions"). b. One-dimensional annotations have a simple data type and a canonical storage format for grouping them (a DataFrame). But back in 2017, even this couldn't easily be stored persistently in an HDF5. So, when writing the on-disk format, I had to write a persistent serialization of a DataFrame and of multi-dimensional annotations. Both was feasible without too much work when done separately as each of these had simple data types, but I wouldn't have wanted to mix it (allowing for multi-dimensional annotations with mixed data types). @ivirshup implemented the latter years later.
  6. Typo: https://github.com/scverse/anndata/pull/825/commits/24895a857a5335801146654115998a3a9825eca0
  7. "h5ad": https://github.com/scverse/anndata/pull/825/commits/66962c58c43de10b09d1dbfc03c3a2aeee993482
  8. "schema tracking": @ivirshup, next comment.

Debate:

  1. "I think it is maybe inaccurate to imply AnnData is a data structure library (The AnnData object section).": @ivirshup & I discussed this remark in early August and thought we'd prefer to keep the term "data structure". While we agree that AnnData is based on existing data structures, it composes these to define a new data structure. We think this is a bit similar to how pandas stores data internally in numpy arrays, but composes them to give rise to a much richer structure, the DataFrame. Like pandas and other "data structure libraries", anndata offers APIs to conveniently interact with the new data structure.
  2. "Clearer statement about filling a gap". We agree: https://github.com/scverse/anndata/pull/825/commits/4bbf466bbb14916a1714b3ac70465c690e9e2a39

Outlook:

  1. "Lessons learned & benchmarks". @ivirshup will respond re benchmarking as he has spent much over the past years. I'm not sure whether there are very specific "lessons learned" for anndata. Clearly, today, a lot could be done differently. But back then, certain requirements made the adoption of say xarray hard to impossible. Isaac may have more things to say.
  2. "Benchmarking": @ivirshup will address this.
  3. "Self-comparative benchmarks": We agree!
  4. "Low-level API": @ivirshup will address this.

Let us follow up with another post to address all outstanding remarks.

All changes discussed in this response can be viewed together in this PR: https://github.com/scverse/anndata/pull/825. They follow on the changes made in response to @rcannood (https://github.com/scverse/anndata/pull/779 / https://github.com/scverse/anndata/issues/769#issuecomment-1144614058).

We'll also address @rcannood's outstanding points beyond https://github.com/scverse/anndata/issues/769#issuecomment-1160600641.

The up-to-date version of the paper is in the paper branch https://github.com/scverse/anndata/pull/666, as required by JOSS.

nlhepler commented 2 years ago

Thank you for the responses! A remaining nitpick.

Text:

  1. I presume I'm waiting on @ivirshup?
  2. "which allows to store" is awkward phrasing --> "which allows one to store" or "which allows storing", etc.
  3. Thank you for this explanation! It is probably too much detail to put into the main text, I wish there was a way to include supplemental information like this.

Debate:

  1. I will concede. It is likely more sensible than any other description.
rcannood commented 2 years ago

@nlhepler Next time, please create a separate issue for your own comments. It's becoming hard to track what still needs to be done :)

rcannood commented 2 years ago

To recap, I think my remaining issues are:

General issues

Functionality

Documentation

Let's make one more push towards getting AnnData published in JOSS. Let's go @falexwolf @ivirshup !

falexwolf commented 2 years ago

I'm fully onboard, @rcannood - I think the remaining points are what you point out + the font sizes of the figures. The repo location can't be fixed within the paper itself, there is no metadata section. So, I'll clarify that with the editors. Unfortunately, I can't do the other things independently but need @ivirshup for this.

@ivirshup, let's go! We've formally worked on this for more than 2 years (https://github.com/ivirshup/anndata-paper/graphs/contributors) and had prepared it for even longer. 😅 It'd be a shame if we didn't finish it after so much time.

falexwolf commented 1 year ago

@ivirshup - Trying to ping you here to get this back on your radar. 😅 Let me know if I can help in any way!

rcannood commented 1 year ago

I met up with @ivirshup during the scverse hackathon. He mentioned that there are benchmarks on ±100'000 cells performed by a CI. I'm eager to see the results :)

falexwolf commented 1 year ago

Great that you caught @ivirshup, @rcannood! I'm eager to see the results, too!

rcannood commented 1 year ago

Hi @ivirshup! Do you have any updates on the availability of the benchmarks on decently sized datasets?

rcannood commented 1 year ago

Having passed the 1 year anniversary for the AnnData JOSS submission, I'd really like to have this submission wrapped up and dealt with. AnnData is a great resource to researchers all over the world, and I think it'd be a pity if this work doesn't get published at JOSS.

@falexwolf The paper mentions:

Furthermore, anndata is systematically benchmarked for performance using airspeed velocity [@Droettboom13], with the results linked from the docs.

@falexwolf Can you remind me where these results are published? The Benchmarks page only links to a simple benchmark and the ivirshup/anndata-benchmarks repository, but I don't see any actual results. Could you help me find them?

falexwolf commented 1 year ago

All benchmarking referenced here is by @ivirshup, I'm not in the loop on the details. @ivirshup, could you help out?

falexwolf commented 10 months ago

Downtuned performance claims as requested by the editor: https://github.com/scverse/anndata/pull/1267

See discussion in the JOSS review: https://github.com/openjournals/joss-reviews/issues/4371

falexwolf commented 10 months ago

This completes a sequence of 3 PRs to the paper branch

For reference, the paper branch is here:

rcannood commented 10 months ago

Thanks @falexwolf! This resolves all of my outstanding comments :) Closing this issue.