openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
696 stars 36 forks source link

[REVIEW]: cerebra: A tool for fast and accurate summarizing of variant calling format (VCF) files #2432

Closed whedon closed 3 years ago

whedon commented 4 years ago

Submitting author: @lincoln-harris (Lincoln Harris) Repository: https://github.com/czbiohub/cerebra Version: v1.2.0 Editor: @lpantano Reviewer: @betteridiot, @afrubin Archive: 10.5281/zenodo.4050557

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/44041ee73f22ed846090242bdcda616a"><img src="https://joss.theoj.org/papers/44041ee73f22ed846090242bdcda616a/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/44041ee73f22ed846090242bdcda616a/status.svg)](https://joss.theoj.org/papers/44041ee73f22ed846090242bdcda616a)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@betteridiot & @afrubin, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @lpantano know.

Please try and complete your review in the next six weeks

Review checklist for @betteridiot

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @afrubin

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @betteridiot, @afrubin it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 4 years ago
Reference check summary:

OK DOIs

- None

MISSING DOIs

- https://doi.org/10.1093/bioinformatics/btl647 may be missing for title: Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases
- https://doi.org/10.1093/bioinformatics/btp324 may be missing for title: Fast and accurate short read alignment with Burrows-Wheeler Transform
- https://doi.org/10.1016/j.cell.2017.09.004 may be missing for title: Single-Cell Analysis of Human Pancreas Reveals Transcriptional Signatures of Aging and Somatic Mutation Patterns
- https://doi.org/10.1038/s41587-019-0074-6 may be missing for title: An open resource for accurately benchmarking small variant and reference calls
- https://doi.org/10.1038/nbt.2835 may be missing for title: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls
- https://doi.org/10.1101/201178 may be missing for title: Scaling accurate genetic variant discovery to tens of thousands of samples
- https://doi.org/10.1186/s13059-017-1248-5 may be missing for title: BRIE: transcriptome-wide splicing quantification in single cells

INVALID DOIs

- None
whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

lpantano commented 4 years ago

Hi @betteridiot, @afrubin!, this is the review issue, please accept the invitation if you haven't done it so you can start the review here.

@lincoln-harris, can you work on adding DOI to your references?

Thanks all!

lincoln-harris commented 4 years ago

sure! i can do that

afrubin commented 4 years ago

It looks like the references aren't formatting properly, possibly due to missing commas in the author names in the .bib file. @lincoln-harris would you please have a look?

afrubin commented 4 years ago

Review

This certainly looks like a useful and performant tool for transforming VCF files into a format that is easier to work with and enables researchers to make biological inferences. I've opened some issues (referenced above) but thought it better to put high-level feedback and comments on the paper in this thread.

I have not run the software yet, but plan to follow the updated installation instructions and run some test data after https://github.com/czbiohub/cerebra/issues/83 is addressed.

Tests

Test coverage is low (81%) and excludes quite a few cases that seem like they might come up when used on diverse data. Error messages/logging may also need to be improved. For example, this section is not covered by tests and includes output to stdout with limited context: https://github.com/czbiohub/cerebra/blob/3553f707290e57440ca9c2ab617667ab7cc4323a/cerebra/find_peptide_variants.py#L65-L69

Several failure conditions (e.g. https://github.com/czbiohub/cerebra/blob/3553f707290e57440ca9c2ab617667ab7cc4323a/cerebra/germline_filter.py#L114-L115) are not covered by tests, although adding test cases for these ought to be trivial. The multiprocessor use case for germline-filter is also not tested.

Documentation

Most usable documentation is in the README. This would be easier to navigate if it were split into multiple files, but another option is to focus on improving the README's content and structure, and dropping the docs/ directory altogether.

Many parts of the documentation seem to be lacking updates and consistency. For example, the instructions in docs/installation.rst are incomplete and different to what's in the README. The docs/authors.rst references a top-level AUTHORS.rst, which doesn't exist (although there is an AUTHORS.md that doesn't include all the contributors). There are similar issues affecting other files.

The repository is in need of a "polish pass" to clean up these kinds of issues and remove unnecessary files (such as .editorconfig in the top level).

Paper

The motivation section should make it clear that an intended use case is single-cell RNA-seq data. This isn't introduced until the find-peptide-variants section, despite being the research example cited by the authors. The motivation section would also benefit from being written in a more scholarly way. The likely audience of this package are bioinformaticians working with VCFs and single-cell data, so the "typo" metaphor is unnecessary.

The authors also use the term "functional predictions" to mean predicted amino-acid changes. This package does not attempt to predict the "functional consequences of the variant", which is a much more ambitious problem. The authors should avoid using the term "functional" throughout the text and focus on what the tool is actually doing - inferring amino acid consequences given variant calls and gene annotations.

There are also numerous missing citations, across packages used (ncls is not cited despite having citation information in its GitHub README), important genomic formats (VCF and HGVS are not cited), and databases (COSMIC is not cited). I should emphasize that this is a partial list of missing citations.

Regarding the use of databases, the authors make the nonsensical claim that "variants that are not found in the COSMIC database [are] less likely to be pathogenic". While there are good reasons to filter for known variants (e.g. focusing on clinical actionability), this isn't one. COSMIC includes a tiny fraction of pathogenic variants, most of which are rare.

For variant discovery, users may want to filter for variants that are not found in COSMIC, and the authors should consider implementing this as an option. This would be especially useful if support for variant databases like gnomAD is added in the future.

Also, the paper mentions dbSNP filtering but it doesn't look like this is implemented based on the documentation.

One small nomenclature issue is that the transcripts in Fig 2 are referred to as "spliceforms" in the text but when alternative start sites are included (like t3) the more general term "isoforms" should be used.

Summary

This is a useful and interesting tool that would be appropriate for JOSS but additional work is needed across the project before it fulfills the criteria.

lincoln-harris commented 4 years ago

thanks for the comments, these are currently being addressed with https://github.com/czbiohub/cerebra/pull/84

betteridiot commented 4 years ago

@lpantano For some reason, GitHub is not allowing me to check the boxes (like in previous reviews). I may need another invitation.

lpantano commented 4 years ago

@whedon re-invite @betteridiot as reviewer

whedon commented 4 years ago

OK, the reviewer has been re-invited.

@betteridiot please accept the invite by clicking this link: https://github.com/openjournals/joss-reviews/invitations

betteridiot commented 4 years ago

Thank you. Apparently it was just a browser extension that was causing the problem.

betteridiot commented 3 years ago

For transparency's sake, I am still actively working on this review. However, I have a thesis committee meeting this week, and will be not be able to get back to the review process until then. Sorry for the delay.

lincoln-harris commented 3 years ago

the majority of reviewer comments have been addressed with the latest release. looking forward to additional feedback!

afrubin commented 3 years ago

@lincoln-harris I have a few additional comments on the writing in the germline-filter section of the paper and the README.

A more typical phrasing from cancer genomics would be "tumor vs. normal" rather than "tumor/pathogenic vs control."

The text says that germline-filter can be used "so as to not bias the results by including non-pathogenic variants." While I understand the motivation for this, you can't say that germline variants are non-pathogenic - they may well be strongly pathogenic for another disease!

Instead of "control and the experimental tissue" the text should stick with the word "sample" (and tumor/normal) for clarity.

lpantano commented 3 years ago

Hi @lincoln-harris,

Any update on the latest @afrubin's comment?

@betteridiot, any update for this review?

Thanks!

betteridiot commented 3 years ago

Yes, tomorrow I will drop all my notes on the Issues page.

betteridiot commented 3 years ago

My notes have been posted @lpantano and @lincoln-harris. If they are addressed, then my review will be complete. Sorry for the delay.

lincoln-harris commented 3 years ago

i believe everything has now been addressed.

most changes were made in this PR, which I've since integrated into master. that PR is annotated with a pretty thorough checklist of everything that I improved upon.

if I'm missing anything or there are any further comments, please let me know!

lincoln-harris commented 3 years ago

@lpantano any updates?

lpantano commented 3 years ago

@betteridiot, @afrubin, do you think you can update on the final version? thanks

lpantano commented 3 years ago

@lincoln-harris, can you comment on the open issues opened by @betteridiot? if they are addressed or not?

betteridiot commented 3 years ago

Only issues of mine not addressed yet is https://github.com/czbiohub/cerebra/issues/108. It is only in regard to community guidelines. However, this is a very minor issue as a CONTRIBUTING.md is already provided. Therefore, this paper is ready for publication on my end.

lpantano commented 3 years ago

Thank you @betteridiot. @afrubin, Did the authors addressed all your comments? Thanks.

lpantano commented 3 years ago

Hey @afrubin, any final comments for this paper? Thanks!

afrubin commented 3 years ago

Hi @lpantano, there are still a few open issues (such as https://github.com/czbiohub/cerebra/issues/83#issuecomment-681311968) and I feel like I should read the final version of the manuscript but I can't find a current PDF. I think it is very close though!

lincoln-harris commented 3 years ago

sorry I've been away for a few weeks. @afrubin I believe https://github.com/czbiohub/cerebra/issues/83#issuecomment-681311968 has now been addressed.

the paper.md on the master branch is fully updated: https://github.com/czbiohub/cerebra/tree/master/paper. would you like me to include a PDF in that folder, or perhaps compile using Whedon and send to you?

afrubin commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

afrubin commented 3 years ago

Looks good to me. My only remaining comment is that there's no API-level documentation for the code itself, and it looks like substantial extra work would be needed to generate the docstrings sphinx.ext.autodoc or similar tools would require.

However, given the focused nature of the software package (being an analysis tool not a library) and the detail in the usage guide, this might be OK. Happy to hear @lpantano weigh in on this.

lpantano commented 3 years ago

Thanks everybody! I think they didn't aim the tool as a library, and I will feel ok not to mention the API. Although it could be a goal in the long term.

@afrubin, @betteridiot can you check all the items in the review list in the first comment of this review if you agree all has been addressed?

If they check all the items, @lincoln-harris, can you confirm the latest version, and prepare an archive for that version in zenodo where the authors list and title matches the paper and send me the DOI?

Thanks!

lincoln-harris commented 3 years ago

here is the DOI: http://doi.org/10.5281/zenodo.4037272

lpantano commented 3 years ago

Hey @afrubin, @betteridiot, can you make sure all the items in the checklist are done? I see a couple of items unchecked and I want to make sure you are fine with them.

lpantano commented 3 years ago

@lincoln-harris, thank you for the DOI. We need the archive to contain all the code as well, so it should be a copy of the repository at the time of publication. Can you update it, please? Thank you!!!

lpantano commented 3 years ago

@whedon set v1.2.0 as version

whedon commented 3 years ago

OK. v1.2.0 is the version.

afrubin commented 3 years ago

@lpantano All items are checked now. Thanks for the reminder!

betteridiot commented 3 years ago

All checked

lincoln-harris commented 3 years ago

@lpantano requesting zenodo access from the git organization. will be able to upload the code archive as soon as that happens.

lincoln-harris commented 3 years ago

ok the git archive has now been added to zenodo. here is the DOI, for latest version of the code, v1.2.0: DOI

lpantano commented 3 years ago

@lincoln-harris thanks for updating this, could you update the title to match the paper’s title? That would be the last thing missing. Thanks!

lincoln-harris commented 3 years ago

done, the titles now match!

lpantano commented 3 years ago

@whedon set 10.5281/zenodo.4050557 as archive

whedon commented 3 years ago

OK. 10.5281/zenodo.4050557 is the archive.

lpantano commented 3 years ago

@whedon accept

whedon commented 3 years ago
Attempting dry run of processing paper acceptance...
whedon commented 3 years ago

PDF failed to compile for issue #2432 with the following error:

/app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-364ded062842/lib/whedon/bibtex_parser.rb:77:in doi_citation': undefined methodencode' for nil:NilClass (NoMethodError) from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-364ded062842/lib/whedon/bibtex_parser.rb:64:in make_citation' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-364ded062842/lib/whedon/bibtex_parser.rb:50:inblock in generate_citations' from /app/vendor/bundle/ruby/2.4.0/gems/bibtex-ruby-5.1.4/lib/bibtex/bibliography.rb:149:in each' from /app/vendor/bundle/ruby/2.4.0/gems/bibtex-ruby-5.1.4/lib/bibtex/bibliography.rb:149:ineach' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-364ded062842/lib/whedon/bibtex_parser.rb:43:in generate_citations' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-364ded062842/lib/whedon/compilers.rb:246:incrossref_from_markdown' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-364ded062842/lib/whedon/compilers.rb:21:in generate_crossref' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-364ded062842/lib/whedon/processor.rb:95:incompile' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-364ded062842/bin/whedon:82:in compile' from /app/vendor/bundle/ruby/2.4.0/gems/thor-0.20.3/lib/thor/command.rb:27:inrun' from /app/vendor/bundle/ruby/2.4.0/gems/thor-0.20.3/lib/thor/invocation.rb:126:in invoke_command' from /app/vendor/bundle/ruby/2.4.0/gems/thor-0.20.3/lib/thor.rb:387:indispatch' from /app/vendor/bundle/ruby/2.4.0/gems/thor-0.20.3/lib/thor/base.rb:466:in start' from /app/vendor/bundle/ruby/2.4.0/bundler/gems/whedon-364ded062842/bin/whedon:119:in<top (required)>' from /app/vendor/bundle/ruby/2.4.0/bin/whedon:23:in load' from /app/vendor/bundle/ruby/2.4.0/bin/whedon:23:in

'

lpantano commented 3 years ago

ops, @lincoln-harris do you think is related to your paper? did you change something from the last time it compiled here?

lincoln-harris commented 3 years ago

I dont think so? paper.md hasnt been changed since Aug 26, and it seemed to compile fine Sept 15th

it seems to be compiling fine with the whedon paper preview tool

Kevin-Mattheus-Moerman commented 3 years ago

@whedon accept