nf-core / kmermaid

k-mer similarity analysis pipeline
https://nf-co.re/kmermaid
MIT License
19 stars 12 forks source link

Pseudo PR - community review for initial release #112

Closed ewels closed 3 years ago

ewels commented 4 years ago

Do not merge - pseudo PR to give review interface for entire pipeline

Community review for initial release. Aim is to get at least two PR approvals from people in the nf-core community who are not the main developers of this pipeline ✅

https://nf-co.re/developers/adding_pipelines#making-the-first-release

ewels commented 4 years ago

Before a more in-depth look at the code, here are a few things that need resolving from a first glance:

Template update

You’re a bit behind the main nf-core template. You created the TEMPLATE branch with version 1.9 of nf-core/tools and we’re on version 1.10.2 (about to release 1.10.3). The new minor patch release might come very soon so hopefully you’ll get an automated sync PR if that happens.

Either way, you need to update the TEMPLATE branch with the latest version and then bring those updates across to the main pipeline code. The update should hopefully not be too tough, mostly additions. For example, you're missing a bunch of GitHub actions files (kmermaid vs template) which will be added.

Test warnings

Generally the tests are looking great - no failures ✅ , however there are a few warnings. These are mostly about updating a bunch of conda packages (no need to do all of these, but some look like they should be easy). The others should be fixed by the template update.

nf-core lint overall result: Passed with warnings ❗

Posted for pipeline commit e1f736c

+| ✅ 109 tests passed       |+
!| ❗  21 tests had warnings |!
-| ❌   0 tests failed       |-
All lint test results ### :heavy_exclamation_mark: Test warnings: * [Test #1](https://nf-co.re/errors#1) - File not found: `.github/workflows/awstest.yml` * [Test #1](https://nf-co.re/errors#1) - File not found: `.github/workflows/awsfulltest.yml` * [Test #6](https://nf-co.re/errors#6) - Found a bioconda environment.yml file but no badge in the README * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `conda-forge::python=3.7.3`, `3.9.0` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `conda-forge::markdown=3.1.1`, `3.3.2` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `conda-forge::pymdown-extensions=6.0`, `8.0.1` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `conda-forge::pygments=2.5.2`, `2.7.1` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `conda-forge::tqdm=4.43.0`, `4.50.2` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `conda-forge::gxx_linux-64=7.3.0`, `9.3.0` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `conda-forge::s3fs=0.4.2`, `0.5.1` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `bioconda::samtools=1.10`, `1.11` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `bioconda::pysam=0.16.0`, `0.16.0.1` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `fastp=0.20.0`, `0.20.1` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `matplotlib=3.1.1`, `3.3.2` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `multiqc=1.8`, `1.9` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `numpy=1.17.5`, `1.19.2` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `pathos=0.2.5`, `0.2.6` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `pip=20.0.2`, `20.2.4` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `pytest=5.3.4`, `6.1.1` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `sphinx=2.3.1`, `3.2.1` available * [Test #8](https://nf-co.re/errors#8) - Conda package is not latest available: `sortmerna=2.1b`, `4.2.0` available ### :white_check_mark: Tests passed: * [Test #1](https://nf-co.re/errors#1) - File found: `nextflow.config` * [Test #1](https://nf-co.re/errors#1) - File found: `nextflow_schema.json` * [Test #1](https://nf-co.re/errors#1) - File found: `Dockerfile` * [Test #1](https://nf-co.re/errors#1) - File found: `LICENSE` or `LICENSE.md` or `LICENCE` or `LICENCE.md` * [Test #1](https://nf-co.re/errors#1) - File found: `README.md` * [Test #1](https://nf-co.re/errors#1) - File found: `CHANGELOG.md` * [Test #1](https://nf-co.re/errors#1) - File found: `docs/README.md` * [Test #1](https://nf-co.re/errors#1) - File found: `docs/output.md` * [Test #1](https://nf-co.re/errors#1) - File found: `docs/usage.md` * [Test #1](https://nf-co.re/errors#1) - File found: `.github/workflows/branch.yml` * [Test #1](https://nf-co.re/errors#1) - File found: `.github/workflows/ci.yml` * [Test #1](https://nf-co.re/errors#1) - File found: `.github/workflows/linting.yml` * [Test #1](https://nf-co.re/errors#1) - File found: `main.nf` * [Test #1](https://nf-co.re/errors#1) - File found: `environment.yml` * [Test #1](https://nf-co.re/errors#1) - File found: `conf/base.config` * [Test #1](https://nf-co.re/errors#1) - File not found check: `Singularity` * [Test #1](https://nf-co.re/errors#1) - File not found check: `parameters.settings.json` * [Test #1](https://nf-co.re/errors#1) - File not found check: `.travis.yml` * [Test #3](https://nf-co.re/errors#3) - Licence check passed * [Test #2](https://nf-co.re/errors#2) - Dockerfile check passed * [Test #4](https://nf-co.re/errors#4) - Config variable found: `manifest.name` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `manifest.nextflowVersion` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `manifest.description` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `manifest.version` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `manifest.homePage` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `timeline.enabled` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `trace.enabled` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `report.enabled` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `dag.enabled` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `process.cpus` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `process.memory` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `process.time` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `params.outdir` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `params.input` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `manifest.mainScript` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `timeline.file` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `trace.file` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `report.file` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `dag.file` * [Test #4](https://nf-co.re/errors#4) - Config variable found: `process.container` * [Test #4](https://nf-co.re/errors#4) - Config variable (correctly) not found: `params.version` * [Test #4](https://nf-co.re/errors#4) - Config variable (correctly) not found: `params.nf_required_version` * [Test #4](https://nf-co.re/errors#4) - Config variable (correctly) not found: `params.container` * [Test #4](https://nf-co.re/errors#4) - Config variable (correctly) not found: `params.singleEnd` * [Test #4](https://nf-co.re/errors#4) - Config variable (correctly) not found: `params.igenomesIgnore` * [Test #4](https://nf-co.re/errors#4) - Config `timeline.enabled` had correct value: `true` * [Test #4](https://nf-co.re/errors#4) - Config `report.enabled` had correct value: `true` * [Test #4](https://nf-co.re/errors#4) - Config `trace.enabled` had correct value: `true` * [Test #4](https://nf-co.re/errors#4) - Config `dag.enabled` had correct value: `true` * [Test #4](https://nf-co.re/errors#4) - Config `manifest.name` began with `nf-core/` * [Test #4](https://nf-co.re/errors#4) - Config variable `manifest.homePage` began with https://github.com/nf-core/ * [Test #4](https://nf-co.re/errors#4) - Config `dag.file` ended with `.svg` * [Test #4](https://nf-co.re/errors#4) - Config variable `manifest.nextflowVersion` started with >= or !>= * [Test #4](https://nf-co.re/errors#4) - Config `process.container` looks correct: `nfcore/kmermaid:dev` * [Test #4](https://nf-co.re/errors#4) - Config `manifest.version` ends in `dev`: `'1.0.0dev'` * [Test #5](https://nf-co.re/errors#5) - GitHub Actions 'branch' workflow is triggered for PRs to master: `./.github/workflows/branch.yml` * [Test #5](https://nf-co.re/errors#5) - GitHub Actions 'branch' workflow looks good: `./.github/workflows/branch.yml` * [Test #5](https://nf-co.re/errors#5) - GitHub Actions CI is triggered on expected events: `./.github/workflows/ci.yml` * [Test #5](https://nf-co.re/errors#5) - CI is building the correct docker image: `docker build --no-cache . -t nfcore/kmermaid:dev` * [Test #5](https://nf-co.re/errors#5) - CI is pulling the correct docker image: docker pull nfcore/kmermaid:dev * [Test #5](https://nf-co.re/errors#5) - CI is tagging docker image correctly: docker tag nfcore/kmermaid:dev nfcore/kmermaid:dev * [Test #5](https://nf-co.re/errors#5) - Continuous integration checks minimum NF version: `./.github/workflows/ci.yml` * [Test #5](https://nf-co.re/errors#5) - GitHub Actions linting workflow is triggered on PR and push: `./.github/workflows/linting.yml` * [Test #5](https://nf-co.re/errors#5) - Continuous integration runs Markdown lint Tests: `./.github/workflows/linting.yml` * [Test #5](https://nf-co.re/errors#5) - Continuous integration runs nf-core lint Tests: `./.github/workflows/linting.yml` * [Test #6](https://nf-co.re/errors#6) - README Nextflow minimum version badge matched config. Badge: `20.07.1`, Config: `20.07.1` * [Test #8](https://nf-co.re/errors#8) - Conda environment name was correct (nf-core-kmermaid-1.0.0dev) * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `conda-forge::python=3.7.3` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `conda-forge::markdown=3.1.1` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `conda-forge::pymdown-extensions=6.0` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `conda-forge::pygments=2.5.2` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `conda-forge::tqdm=4.43.0` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `conda-forge::gxx_linux-64=7.3.0` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `conda-forge::s3fs=0.4.2` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `bioconda::sourmash=3.5.0` * [Test #8](https://nf-co.re/errors#8) - Conda package is latest available: `bioconda::sourmash=3.5.0` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `bioconda::samtools=1.10` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `bioconda::screed=1.0.4` * [Test #8](https://nf-co.re/errors#8) - Conda package is latest available: `bioconda::screed=1.0.4` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `bioconda::khmer=3.0.0a3` * [Test #8](https://nf-co.re/errors#8) - Conda package is latest available: `bioconda::khmer=3.0.0a3` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `bioconda::pysam=0.16.0` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `anaconda::make=4.2.1` * [Test #8](https://nf-co.re/errors#8) - Conda package is latest available: `anaconda::make=4.2.1` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `alabaster=0.7.12` * [Test #8](https://nf-co.re/errors#8) - Conda package is latest available: `alabaster=0.7.12` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `fastp=0.20.0` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `fastqc=0.11.9` * [Test #8](https://nf-co.re/errors#8) - Conda package is latest available: `fastqc=0.11.9` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `matplotlib=3.1.1` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `multiqc=1.8` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `numpy=1.17.5` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `pathos=0.2.5` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `pip=20.0.2` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `pytest=5.3.4` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `seqtk=1.3` * [Test #8](https://nf-co.re/errors#8) - Conda package is latest available: `seqtk=1.3` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `ska=1.0` * [Test #8](https://nf-co.re/errors#8) - Conda package is latest available: `ska=1.0` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `sphinx=2.3.1` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `jupyter=1.0.0` * [Test #8](https://nf-co.re/errors#8) - Conda package is latest available: `jupyter=1.0.0` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `sortmerna=2.1b` * [Test #8](https://nf-co.re/errors#8) - Conda dependency had pinned version number: `ripgrep=12.1.1` * [Test #8](https://nf-co.re/errors#8) - Conda package is latest available: `ripgrep=12.1.1` * [Test #8](https://nf-co.re/errors#8) - Pip dependency had pinned version number: bam2fasta==1.0.8 * [Test #8](https://nf-co.re/errors#8) - PyPi package is latest available: 1.0.8 * [Test #8](https://nf-co.re/errors#8) - Pip dependency had pinned version number: sencha==1.0.3 * [Test #8](https://nf-co.re/errors#8) - PyPi package is latest available: 1.0.3 ### Run details: * nf-core/tools version 1.11.dev0 * Run at `2020-10-22 09:36:48`

Files that shouldn't be there

I can see a few files that can be / should be removed I think:

olgabot commented 4 years ago

Hello everyone! Thank you for the detailed comments. Responding to @MaxUlysse's comments below (which addresses @ewels's concerns and more!):

Switching to a single comment instead of commenting into files, not to spam everyone.

  • .gitignore seems to be a bit overkill to me.

Not sure that this is a major problem? There are Python files in the bin and when I edit with PyCharm, then excess files get created. It seems to be fine to specify all the possible files that may want to be ignored, but open to suggestions!

  • Dockerfile I guess that the last lines are to test that tools are working. Is it really necessary?

Yes, it helps to make sure everything got installed properly in the Docker image. I've had issues with the Docker image even working in the first place without having those checks in the past.

  • Makefile I don't think it's actually useful in the repo

This speeds up my local workflow to build the docker image locally and not have to look up the docker build command every time. I can remove the docker push rule if that would help.

  • README:

    • There are still some TODO comments left.

👍

  • Usage should be made less specific

👍

  • I think you can already set up the Zenodo since you have a tag

👍

  • There is two assets/email_template file, not sure which one we should keep html or txt

According to the latest nf-core/tools template, both are needed? https://github.com/nf-core/tools/tree/master/nf_core/pipeline-template/%7B%7Bcookiecutter.name_noslash%7D%7D/assets

  • assets/rrna-db-defaults.txt what is this file used for?

It's an input to SortMeRNA for removing ribosomal reads.

https://github.com/nf-core/kmermaid/blob/e1f736c5c694037d548af53e8d1d1cacf120327f/nextflow.config#L60

  • Remove bin/markdown_to_html.r, conf/awsbatch.config, docs/configuration/adding_your_own.md, docs/installation.md, docs/troubleshooting.md

👍

  • docker/sysctl.conf what is this file used for?

It was needed to deal with some temporary directory issues in Docker, and I think to help with some error messages.

  • params

    • --csv_pairs and --csv_singles could both be replaced by a single --input. since you have a header, you could check the number of column to asses if single or double.

I would like users to be able to provide both --csv_pairs and --csv_singles at once, rather than a single --input. There are cases where one would want to run the pipeline, which does an end-to-end k-mer similarity comparison, and get a matrix of sample-sample similarities in the end. It's not like with nf-core/rnaseq where the matrices can simply be concatenated, all the samples can be provided at once.

  • --fastas and --protein_fastas why plural?

They are not a reference genome or proteome like in other pipelines (the only user-provided reference data is --reference_proteome_fasta). They are input options to the pipeline, so they are plural. --fastas is for nucleotide sequence input, e.g. for a microbial genome or a fasta of human transcripts, and --protein is for protein sequence, e.g. for protein-coding transcripts.

This pipeline takes as input several possible sequences, optionally remove ribosomal RNA, optionally translate to nucleotide -> protein, then subsamples k-mers and compute an all-by-all k-mer similarity. The diversity of input options is intentional, such as to provide the ability to input:

  1. Paired end reads, e.g. from Brawand2011 brain samples which were all paired end
  2. Single end reads, e.g. the remaining non-brain tissues (heart, liver, kidney, etc) samples from Brawand2011
  3. Paired end reads as a CSV with user-defined sample IDs that don't have to match the fastq names
  4. Single end reads as a CSV with user-defined sample IDs that don't have to match the fastq names
  5. Nucleotide fastas with no sequencing quality scores, e.g. "compute the k-mer similarity of these assembled transcriptomes"
  6. Protein fastas with no sequencing quality scores, e.g. "compute the k-mer similarity of these translated assembled transcriptomes"
  7. SRA ids of a newly-uploaded dataset

And all of the above at once! This pipeline is supposed to be as flexible as possible to compare the maximum number of samples' k-mer similarities. It's typically not a "one-and-done" pipeline - I often run it many times on different configurations of the same samples.

  • --removeRiboRNA, --saveNonRiboRNAReads, --rNA_database_manifest are not snake_cased.

This was copied from nf-core/rnaseq with abandon and lazily not changed. :)

  • --molecule or --molecules, I've seen both in the docs and configs so far.

👍

  • --save_fastas should be --save_fasta_dir

👍

  • Why the environment_hulk.yml file?

We have a local machine that needs a lot of love and care when installing kmermaid so we made it a special environment file.

  • nextflow.config not up to date with TEMPLATE

These PRs didn't do it?

  • there should not be any testing data in the repo, but in the nf-core/test-datasets one, on the kmermaid branch

👍

Apart from these tiny comments, I am quite happy with the pipeline, I think you're mainly missing updates from the TEMPLATE, and maybe some input params improvement.

🎉 Thanks! Glad to hear! 🎉

maxulysse commented 4 years ago

Thanks for your reply and comments. Just replying to ones that need to, as you were very clear on all the other comments.

  • .gitignore seems to be a bit overkill to me.

Not sure that this is a major problem? There are Python files in the bin and when I edit with PyCharm, then excess files get created. It seems to be fine to specify all the possible files that may want to be ignored, but open to suggestions!

Not a problem at all, I agree, it was more a remark.

  • Dockerfile I guess that the last lines are to test that tools are working. Is it really necessary?

Yes, it helps to make sure everything got installed properly in the Docker image. I've had issues with the Docker image even working in the first place without having those checks in the past.

OK, then I have no issue keeping that, I'm pretty sure that with DSL2 we're moving out of Dockerfiles anyway

  • Makefile I don't think it's actually useful in the repo

This speeds up my local workflow to build the docker image locally and not have to look up the docker build command every time. I can remove the docker push rule if that would help.

You said it yourself: This speeds up my local workflow, I don't think it belongs here.

  • There is two assets/email_template file, not sure which one we should keep html or txt

According to the latest nf-core/tools template, both are needed? https://github.com/nf-core/tools/tree/master/nf_core/pipeline-template/%7B%7Bcookiecutter.name_noslash%7D%7D/assets

Totally right, my bad...

  • Why the environment_hulk.yml file?

We have a local machine that needs a lot of love and care when installing kmermaid so we made it a special environment file.

You said it yourself: We have a local machine, I don't think it belongs here.

BTW, we do have a Hulk as well at NGI ;-)

  • nextflow.config not up to date with TEMPLATE

These PRs didn't do it?

* #93

* #110

At least these lines are not in your config file: https://github.com/nf-core/tools/blob/f14c7a5692c96896c2ed1b730bc355093dca2247/nf_core/pipeline-template/%7B%7Bcookiecutter.name_noslash%7D%7D/nextflow.config#L82-L87

(That was what make me thinking that it might not have been up to date to start with)

I will have a more in depth look to make sure

Apart from these tiny comments, I am quite happy with the pipeline, I think you're mainly missing updates from the TEMPLATE, and maybe some input params improvement.

:tada: Thanks! Glad to hear! :tada:

Looking forward for the big release!!!

ewels commented 4 years ago

Regarding the template merge, I guessed you used version 1.9 because of this line: https://github.com/nf-core/kmermaid/blob/7907ca5dc827fd69895a6aeea8d17c39f72184d7/Dockerfile#L1

But if you don't think that you did, then it probably warrants a closer comparison between the template files and the pipeline files...

+1 for removing institute specific stuff. I know it's a hassle, we had to do similar things for nearly all of our pipelines when we ported them to be nf-core instead of SciLifeLab (it wasn't immediate, an uppmax config profile shipped with all pipelines for a long time). But this is really the heart of nf-core: that pipelines are generalised as much as is possible so as to be first-class pipelines for anyone anywhere.

ewels commented 4 years ago

Stupid "Close with comment" button is too close to the "Comment" button, sorry..