sanger-tol / blobtoolkit

Nextflow pipeline for BlobToolKit for Sanger ToL production suite
https://pipelines.tol.sanger.ac.uk/blobtoolkit
MIT License
10 stars 0 forks source link

Busco subworkflow #28

Closed alxndrdiaz closed 1 year ago

alxndrdiaz commented 1 year ago

PR checklist

This PR is part of the Google Summer of Code 2022 project: Conversion of the BlobToolKit pipeline to Nextflow, GitHub repository: https://github.com/sanger-tol/blobtoolkit

alxndrdiaz commented 1 year ago

The subworflow stops at the BUSCO step:

 busco \
      --cpu 2 \
      --in "$INPUT_SEQS" \
      --out GCA_922984935.2.subset.unmasked-[carnivora_odb10, laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10]-busco \
      --lineage_dataset [carnivora_odb10, laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10] \
       \
       \
      --mode genome

Command error:
  usage: busco -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
  busco: error: unrecognized arguments: laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10]-busco laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10]

From the error description it seems that the method used to convert the text file input to a list of values is not working as expected see line 57 in the subworkflow script.

alxndrdiaz commented 1 year ago

Had to restore the branch using git-repair then push a single commit that includes all changes (git interruption due to laptop power off), so these are the two commits associated to the recovery that were pushed:

fbef6f - alxndrdiaz, 3 minutes ago : fixed conflicts
0e1b1df - alxndrdiaz, 15 minutes ago : restored busco_subworkflow branch

Old single commits associated to requested changes:

f391868 - alxndrdiaz, 53 seconds ago : added BUSCO lineages path
1813171 - alxndrdiaz, 4 minutes ago : updated BUSCO input uusing single channel
809a1e2 - alxndrdiaz, 22 minutes ago : removed goat_taxosearch parameterd from subworkflow input
f17ba13 - alxndrdiaz, 26 minutes ago : fasta updated to tuple containing [meta, fasta]
9252c48 - alxndrdiaz, 32 minutes ago : added input tuple forBUSCO_DIAMOND which includes meta
af0fffa - alxndrdiaz, 38 minutes ago : restored meta to generate prefix
afb0cd4 - alxndrdiaz, 81 minutes ago : restored tuple input
4f9d553 - alxndrdiaz, 2 hours ago : removed SAMTOOLS_VIEW module
a70da26 - alxndrdiaz, 2 hours ago : removed samtools_view.nf module
9530327 - alxndrdiaz, 2 hours ago : moved module parameters from test.config to modules.config
priyanka-surana commented 1 year ago

The subworflow stops at the BUSCO step:

 busco \
      --cpu 2 \
      --in "$INPUT_SEQS" \
      --out GCA_922984935.2.subset.unmasked-[carnivora_odb10, laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10]-busco \
      --lineage_dataset [carnivora_odb10, laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10] \
       \
       \
      --mode genome

Command error:
  usage: busco -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
  busco: error: unrecognized arguments: laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10]-busco laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10]

From the error description it seems that the method used to convert the text file input to a list of values is not working as expected see line 57 in the subworkflow script.

Have you tried fromList or something similar. Not sure if it already solved or not.

alxndrdiaz commented 1 year ago

The subworflow stops at the BUSCO step:

 busco \
      --cpu 2 \
      --in "$INPUT_SEQS" \
      --out GCA_922984935.2.subset.unmasked-[carnivora_odb10, laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10]-busco \
      --lineage_dataset [carnivora_odb10, laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10] \
       \
       \
      --mode genome

Command error:
  usage: busco -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
  busco: error: unrecognized arguments: laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10]-busco laurasiatheria_odb10, eutheria_odb10, mammalia_odb10, tetrapoda_odb10, vertebrata_odb10, metazoa_odb10, eukaryota_odb10, bacteria_odb10, archaea_odb10]

From the error description it seems that the method used to convert the text file input to a list of values is not working as expected see line 57 in the subworkflow script.

Have you tried fromList or something similar. Not sure if it already solved or not.

I tried it but got another error. As far as I understand I should pass a list of lineages in this case. So, still unsolved.

muffato commented 1 year ago

I managed to Busco work. I think the problem was that each, which is used in BUSCO to unroll the list of lineages https://github.com/nf-core/modules/blob/master/modules/busco/main.nf#L12 doesn't work on dynamic lists, cf https://github.com/nextflow-io/nextflow/issues/1566 and the green box at https://www.nextflow.io/docs/latest/process.html#input-repeaters But also, when given a channel, each will run each element in it against the other inputs, which is actually not great, because it could potentially match a fasta file against a lineage coming from another fasta file (if the sub-workflow is called multiple times).

My solution, which I have pushed in the commit 178a8f2, is to turn the each into a regular val in the busco module, make goat_taxon_search output meta in the lineages output channel, and some Nextflow channel manipulation.

I'll raise an issue on nf-core/modules about this each in the Busco module.

muffato commented 1 year ago

✌🏼

Input_file      Dataset Complete        Single  Duplicated      Fragmented      Missing n_markers       Scaffold N50    Contigs N50     Percent gaps    Number of scaffolds
GCA_927399515.1.fa      agaricomycetes_odb10    96.4    95.3    1.1     0.5     3.1     2898    2582225 2582225 0.002%  14
GCA_927399515.1.fa      eukaryota_odb10 95.7    93.3    2.4     3.1     1.2     255     2582225 2582225 0.002%  14
GCA_927399515.1.fa      polyporales_odb10       95.8    94.8    1.0     0.8     3.4     4464    2582225 2582225 0.002%  14
GCA_927399515.1.fa      bacteria_odb10  7.3     7.3     0.0     16.9    75.8    124     2582225 2582225 0.002%  14
GCA_927399515.1.fa      archaea_odb10   5.7     5.2     0.5     13.4    80.9    194     2582225 2582225 0.002%  14
GCA_927399515.1.fa      fungi_odb10     97.7    96.2    1.5     0.3     2.0     758     2582225 2582225 0.002%  14
GCA_927399515.1.fa      basidiomycota_odb10     96.4    94.9    1.5     0.7     2.9     1764    2582225 2582225 0.002%  14
alxndrdiaz commented 1 year ago

✌🏼

Input_file      Dataset Complete        Single  Duplicated      Fragmented      Missing n_markers       Scaffold N50    Contigs N50     Percent gaps    Number of scaffolds
GCA_927399515.1.fa      agaricomycetes_odb10    96.4    95.3    1.1     0.5     3.1     2898    2582225 2582225 0.002%  14
GCA_927399515.1.fa      eukaryota_odb10 95.7    93.3    2.4     3.1     1.2     255     2582225 2582225 0.002%  14
GCA_927399515.1.fa      polyporales_odb10       95.8    94.8    1.0     0.8     3.4     4464    2582225 2582225 0.002%  14
GCA_927399515.1.fa      bacteria_odb10  7.3     7.3     0.0     16.9    75.8    124     2582225 2582225 0.002%  14
GCA_927399515.1.fa      archaea_odb10   5.7     5.2     0.5     13.4    80.9    194     2582225 2582225 0.002%  14
GCA_927399515.1.fa      fungi_odb10     97.7    96.2    1.5     0.3     2.0     758     2582225 2582225 0.002%  14
GCA_927399515.1.fa      basidiomycota_odb10     96.4    94.9    1.5     0.7     2.9     1764    2582225 2582225 0.002%  14

I tried to reproduce the results with another dataset (mMelMel3), but it seems it only runs for the first lineage:

GCA_922984935.2.subset.unmasked-carnivora_odb10-busco/
GCA_922984935.2.subset.unmasked-carnivora_odb10-busco.batch_summary.txt

So, I will try again. I did not change anything except for removing samtools/view module and also updated lineages_path to a channel:

// Channel of busco ch_lineages
ch_lineages_path = Channel.fromPath(params.busco_lineages_path)
muffato commented 1 year ago

This is the command that I run

nextflow run src/blobtoolkit -profile singularity --fasta ~mm49/nfs/scratch123/nextflow/GCA_927399515.1.fa --outdir btk_out --busco_lineages_path /lustre/scratch123/tol/resources/busco/v5/ --taxon 'Laetiporus sulphureus'

As you can see, I'm using a Fasta file directly. This is because I've locally removed the input_check subworkflow (I only wanted to test BUSCO). You can see my changes at https://github.com/sanger-tol/blobtoolkit/commit/52196f103411d1597983857e56ff75e1ea80fbd8 . You can check this branch out, it's exactly what I used (and it worked). Note: this commit is based off https://github.com/sanger-tol/blobtoolkit/pull/28/commits/afd328b6082db6e4639321f7b9a73cb55a5bda19 and doesn't include https://github.com/sanger-tol/blobtoolkit/pull/28/commits/905139b7f0beff99cd4df8002ab1ae05c3e246eb

alxndrdiaz commented 1 year ago

This is the command that I run

nextflow run src/blobtoolkit -profile singularity --fasta ~mm49/nfs/scratch123/nextflow/GCA_927399515.1.fa --outdir btk_out --busco_lineages_path /lustre/scratch123/tol/resources/busco/v5/ --taxon 'Laetiporus sulphureus'

As you can see, I'm using a Fasta file directly. This is because I've locally removed the input_check subworkflow (I only wanted to test BUSCO). You can see my changes at 52196f1 . You can check this branch out, it's exactly what I used (and it worked). Note: this commit is based off afd328b and doesn't include 905139b

I think the issue was that I was using an incorrect busco_lineages_path = './', I will try again using:

busco_lineages_path = '/lustre/scratch123/tol/resources/busco/v5/'
muffato commented 1 year ago

Besides making this subworkflow work, there are two outstanding tasks:

muffato commented 1 year ago

The BUSCO module has been fixed !

priyanka-surana commented 1 year ago

I am a bit concerned about the current commit history. Seems like the rebase did not happen correctly. @zb32 is working on it again.

alxndrdiaz commented 1 year ago

I am a bit concerned about the current commit history. Seems like the rebase did not happen correctly. @zb32 is working on it again.

I was not sure about to push the most recent changes, but I created a local branch and will include the changes that I have made once the rebase is fixed.

alxndrdiaz commented 1 year ago

I am a bit concerned about the current commit history. Seems like the rebase did not happen correctly. @zb32 is working on it again.

Steps for busco_ar34 branch rebase, following git-rebase documentation.

from branch busco_ar34 create a new branch locally (and on GitHub):

git checkout -b busco_rebase
git pull
git branch --set-upstream-to=origin/busco_rebase busco_rebase

create local fix_dev branch

git checkout --track origin/fix_dev

go back to busco_rebase branch

git checkout busco_rebase

rebase

git rebase fix_dev busco_rebase

it fails with error:

error: Failed to merge in the changes.
Patch failed at 0001 restored busco_subworkflow branch
hint: Use 'git am --show-current-patch' to see the failed patch
Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort". 

use hint

git am --show-current-patch

hint output

commit 0e1b1df030e6849d95325ba186233b2083a722b5
Author: alxndrdiaz <ra.ramos.diaz@gmail.com>
Date:   Thu Sep 8 15:34:56 2022 -0600
    restored busco_subworkflow branch
.....

then check status

git status

rebase in progress; onto da56a84
You are currently rebasing branch 'busco_rebase' on 'da56a84'.
  (fix conflicts and then run "git rebase --continue")
  (use "git rebase --skip" to skip this patch)
  (use "git rebase --abort" to check out the original branch)

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    new file:   modules/local/goat_taxon_search.nf
    new file:   modules/nf-core/modules/busco/main.nf
    new file:   modules/nf-core/modules/busco/meta.yml
    new file:   modules/nf-core/modules/custom/dumpsoftwareversions/main.nf
    new file:   modules/nf-core/modules/custom/dumpsoftwareversions/meta.yml
    new file:   modules/nf-core/modules/custom/dumpsoftwareversions/templates/dumpsoftwareversions.py
    new file:   modules/nf-core/modules/multiqc/main.nf
    new file:   modules/nf-core/modules/multiqc/meta.yml
    new file:   modules/nf-core/modules/samtools/view/main.nf
    new file:   modules/nf-core/modules/samtools/view/meta.yml
    new file:   subworkflows/local/busco_diamond_blastp.nf

Unmerged paths:
  (use "git restore --staged <file>..." to unstage)
  (use "git add <file>..." to mark resolution)
    both added:      .github/PULL_REQUEST_TEMPLATE.md
    both added:      .github/workflows/linting.yml
    both added:      .prettierignore
    both added:      README.md
    both added:      assets/email_template.txt
    both added:      assets/samplesheet.csv
    both added:      assets/schema_input.json
    both added:      bin/tol_input.sh
    both added:      conf/modules.config
    both added:      conf/test.config
    both added:      docs/usage.md
    both added:      lib/NfcoreTemplate.groovy
    both added:      modules.json
    both added:      nextflow_schema.json
    both added:      subworkflows/local/input_check.nf
    both added:      workflows/blobtoolkit.nf

Then used git rebase --continue to go through each failing patch and fix conflict files, until git status output is:

On branch busco_rebase
Your branch and 'origin/busco_rebase' have diverged,
and have 122 and 116 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
nothing to commit, working tree clean

Then used git pull, but found more conflicts:

On branch busco_rebase
Your branch and 'origin/busco_rebase' have diverged,
and have 122 and 116 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

You have unmerged paths.
  (fix conflicts and run "git commit")
  (use "git merge --abort" to abort the merge)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
    both modified:   conf/test.config
    added by us:     modules/nf-core/samtools/view/main.nf
    added by us:     modules/nf-core/samtools/view/meta.yml
    both added:      subworkflows/local/busco_diamond_blastp.nf

no changes added to commit (use "git add" and/or "git commit -a")

Fixed all conflicts then used git push to update the changes in the busco_rebase remote branch:

Enumerating objects: 464, done.
Counting objects: 100% (464/464), done.
Delta compression using up to 4 threads
Compressing objects: 100% (382/382), done.
Writing objects: 100% (437/437), 45.75 KiB | 1.48 MiB/s, done.
Total 437 (delta 204), reused 1 (delta 0)
remote: Resolving deltas: 100% (204/204), completed with 10 local objects.
To https://github.com/sanger-tol/blobtoolkit.git
   ccccdd6..de9facd  busco_rebase -> busco_rebase
muffato commented 1 year ago

At the first glance, the rebase looks mostly OK. But why did you then merge busco_ar34 in it ? The whole point is to have a single, clean, branch, with all the commits and no merge in it.

But more importantly, the busco_rebase branch doesn't pass the nextflow run CI. Even locally, the test profile doesn't start with its default parameters. Can you fix it ?

Note: you rebased on top of fix_dev, which doesn't pass the linting CI. I could perhaps fix the latter and merge #35 first ? Then you'd have to rebase on top of the new dev.

alxndrdiaz commented 1 year ago

At the first glance, the rebase looks mostly OK. But why did you then merge busco_ar34 in it ? The whole point is to have a single, clean, branch, with all the commits and no merge in it.

But more importantly, the busco_rebase branch doesn't pass the nextflow run CI. Even locally, the test profile doesn't start with its default parameters. Can you fix it ?

Note: you rebased on top of fix_dev, which doesn't pass the linting CI. I could perhaps fix the latter and merge #35 first ? Then you'd have to rebase on top of the new dev.

muffato commented 1 year ago

Like Priyanka said, ignore Zaynab's for now, and just rebase your own changes onto dev. The key is that nextflow run . -profile test,singularity without any other command-line arguments must still run without any problem on your branch. That'll be the first condition for merging the new branch.

alxndrdiaz commented 1 year ago

Like Priyanka said, ignore Zaynab's for now, and just rebase your own changes onto dev. The key is that nextflow run . -profile test,singularity without any other command-line arguments must still run without any problem on your branch. That'll be the first condition for merging the new branch.

I will close this PR and will open a new one with the rebased branch.