sanjaynagi / AmpSeeker

A snakemake workflow for amplicon sequencing
https://sanjaynagi.github.io/AmpSeeker/
0 stars 3 forks source link

igv notebook and renaming #14

Closed sanjaynagi closed 1 year ago

sanjaynagi commented 1 year ago

In this PR, I added a new notebook IGV-explore.ipynb which loads BAM files into IGV viewer, allowing the user to visually explore reads in more detail. I had never done this before.

To pass parameters to the notebook, I have used the program papermill (you cannot pass arguments to jupyter notebooks normally). This is done by adding the cell tag 'parameters' to the first cell of the notebook, and then passing those parameters to papermill.

@ChabbyTMD , @eddUG I think when we develop analytical scripts, we should use this papermill/notebook method. This is cool because we can write some accompanying text in the notebooks that makes sense of it, and can convert those notebooks directly to HTML to share results with others.

I've added a few tools to the Conda env (AmpSeq.yaml), and had to change a few other small things in the workflow. The user just provides a single reference file now, rather than both the amplicon only and whole genome references.

will close #13

sanjaynagi commented 1 year ago

I've also renamed a few of the rule files (.smk) and added a new one (analysis.smk)

sanjaynagi commented 1 year ago

@eddUG tbh, the conda stuff was driving me mad and is very confusing - an environment will build perfectly locally but in github actions it will fail. So I ended up splitting the conda env into two, one for command line tools and one for python-based analyses, which works well for now.

Papermill should hopefully be v2 now, but it shouldn't matter, and I have removed the snakemake dependency. I also dont quite understand what defaults/nodefaults will do....

Will merge this PR once @ChabbyTMD has had a look.

ChabbyTMD commented 1 year ago

Hi @sanjaynagi, tried testing this branch on the UVRI cluster. I'm experiencing an issue with missing input files. I think it has to do with the amended config files that reference locations on your local device. See below; reference_fasta: /home/sanj/projects/AgamDao/AmpSeq2023/resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa reference_gff3: /home/sanj/projects/AgamDao/AmpSeq2023/resources/reference/Anopheles-gambiae-PEST_BASEFEATURES_AgamP4.12.gff3

MissingInputException in line 3 of /home/tmugoya/PIPELINE/Ampseek2/AmpSeeker/workflow/rules/alignment_variantcalling.smk: Missing input files for rule reference_index: /home/sanj/projects/AgamDao/AmpSeq2023/resources/reference/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa As a result the pipeline is failing on the UVRI cluster.

sanjaynagi commented 1 year ago

Indeed, due to IGV we now need to supply a gff3 file to the pipeline.

The AgamP4 gff can be downloaded here https://vectorbase.org/common/downloads/Current_Release/AgambiaePEST/gff/data/VectorBase-61_AgambiaePEST.gff

Although Ive supplied gff3 so hopefully either format will work.

ChabbyTMD commented 1 year ago

Yes I do see the gff3 file. What I'm not sure about is the reference being used. Are we still using the amplicon reference as before?

eddUG commented 1 year ago

@ChabbyTMD It seems that in this pull request, we are utilizing the whole genome reference. You can check the reference_type value on this page: https://github.com/sanjaynagi/AmpSeeker/blob/igv-notebook-and-smk-renaming-02-03-23/config/config.yaml

@sanjaynagi, could you modify the reference_fasta and reference_gff3 paths from local to repo?

sanjaynagi commented 1 year ago

We still have the option to use either reference. Previously I had used both references (modifying the name of the results folder so I can re-run the pipeline). And the IGV-notebook will work with either I'm pretty sure, only the coordinate system will change.

Just modified the paths in the config to make them relative.

sanjaynagi commented 1 year ago

Have just removed a bunch of old scripts and unused folders (report, schemas) in the workflow/ folder to simplify whats in the repo.