sanger-tol / blobtoolkit

Nextflow pipeline for BlobToolKit for Sanger ToL production suite
https://pipelines.tol.sanger.ac.uk/blobtoolkit
MIT License
10 stars 0 forks source link

Coverage tsv #38

Closed zb32 closed 1 year ago

zb32 commented 1 year ago

Draft PR for the count_buscogenes and coverage_tsv modules and their integration into the pipeline commenting out anything after BUSCO in busco_diamond_blastp. BUSCO seems to break the CI testing.

PR checklist

priyanka-surana commented 1 year ago

I would recommend splitting the PR for the 2 modules. The count_busco_genes may fit better with busco subworkflow and coverage_to_tsv in the coverage_stats subworkflow. All your other modules - fasta_windows, mosdepth and create_bed should fit in the coverage_stats subworkflow. If you are going to diverge from the original structure then explain why? Execution advantage, efficiency, logical flow, etc.

BUSCO does not work for GitHub CI testing because it needs a database which is not on S3. For this pipeline, we have to move exclusively to Sanger CI testing that Guoying wrote. But this can be a separate PR.

Not sure why the coverage_stats subworkflow is called within the busco subworkflow. I would have expected them separate but integrated via the workflow script. Consider more about the structure and how to keep it simple for future changes.

zb32 commented 1 year ago

I would recommend splitting the PR for the 2 modules. The count_busco_genes may fit better with busco subworkflow and coverage_to_tsv in the coverage_stats subworkflow. All your other modules - fasta_windows, mosdepth and create_bed should fit in the coverage_stats subworkflow. If you are going to diverge from the original structure then explain why? Execution advantage, efficiency, logical flow, etc.

BUSCO does not work for GitHub CI testing because it needs a database which is not on S3. For this pipeline, we have to move exclusively to Sanger CI testing that Guoying wrote. But this can be a separate PR.

Not sure why the coverage_stats subworkflow is called within the busco subworkflow. I would have expected them separate but integrated via the workflow script. Consider more about the structure and how to keep it simple for future changes.

The coverage_stats subworkflow is called within busco because originally I tried to have the output for the output BED from coverage_stats subworkflow by calling it within the other subworkflow instead of passing the output file it via a parameter for the subworkflow but this didn't work and I must have forgot to remove the include statement. I can separate the two modules and create two different PRs for them. I can also move fasta_windows, mosdepth and create_bed back into the coverage_stats subworkflow. Originally I moved it out because both subworkflows require the BED file so I thought I could move it into the main workflow but this doesn't really having any execution or efficacy advantage to my knowledge.

priyanka-surana commented 1 year ago

Is this being built on busco_dev branch? Why the comment out? It might cause issues when we go to merge. Your code should still work as the error with busco subworkflow is more logic than syntax.

zb32 commented 1 year ago

Is this being built on busco_dev branch? Why the comment out? It might cause issues when we go to merge. Your code should still work as the error with busco subworkflow is more logic than syntax.

It's being built on busco_clean as I think I created it before busco_dev was created and whilst Alexander was still making some changes to the busco_diamond_blastp subworkflow. For the two new PRs for count_buscogenes and coverage_tsv I'll base them off the busco_dev branch.