Execution workflow: QAPA execution workflow inconsistent with standards (input annotations, README, Dockerfile)

SamBryce-Smith commented 2 years ago

The QAPA execution workflow was merged during the hackathon, but our standards for execution workflows have been consolidated since then and the QAPA workflow is now out of date. Implementing the changes below will bring QAPA in line with the rest of the workflows.

Workflow should generate custom annotations from provided reference GTF and BED file(s) of polyA sites

Annotation pre-processing is quite clearly documented in the QAPA README, so this shouldn't be too tricky to implement. It seems like all annotation tables can be generated from a reference GTF so I think we should start with that. PolyA site annotations need to be provided as BED files but passing different sources to qapa build is a bit convoluted, so I think we should have separate keys in the config file.

To my mind the following steps are required:

Generate 'gene metadata' table from input GTF. This is an uncompressed, 5-column, tab-separated file with the following columns, all of which can be obtained from a reference GTF. They suggest downloading from Biomart, but I think it's straightforward for us to generate from the input the workflow needs so I suggest we do that. Here's a quick screenshot of the provided human example file

Generate 'gene prediction' (genePred) table

GenePred files can be generated from an input GTF file as described in the instructions using UCSC tool's gtfToGenePred:

gtfToGenePred -genePredExt <input.gtf> <output.genePred>

Run QAPA build to generate annotations

Key decision here is what combination of polyA site annotations to provide to QAPA build, each of which requires a different set of command line flags. It seems the following are possible:

Standard workflow - provide PolyASite BED & GENCODE poly(A) track BED file (it seems both have to be provided, can't have one or the other)
Provide a single custom BED file of polyA sites
Don't provide any polyA sites, just use reference gene/transcript models to produce alternative 3'UTR annotations.

I think we should make the standard workflow the default option. Ideally the Ewf can be flexible to all polyA site annotation combos, but in the interests of time I think we should prioritise the standard workflow first.

Pass extracted 3'UTRs (output of QAPA build) into the rest of the workflow as already implemented.

README & sample-sheet updates

README needs a general update, including but not limited to details on prerequisites, qualifying APAeval challenges, pipeline parameter descriptions, running instructions & citation. One of @faricazjj 's Nextflow workflows (e.g. Dapars) are excellent examples of what we're looking for. References to manually installing QAPA should also be removed (this will be covered by Dockerfiles, see section below too).

Sample sheet should only contain the sample names and paths to FASTQ files. BAM & BAI files are never consumed by QAPA so should be removed entirely. Reference files such as GTFs, polyA site BEDs are shared across all samples provided in the sample sheet, so as universal options should be specified in the pipeline's general config to minimise duplication.

Missing Dockerfile in repository

Already reported by @dominikburri in #194 and has been added in the https://github.com/iRNA-COSI/APAeval/pull/196, which relates to updating the Nextflow template. If this PR cannot be merged quickly, it would be better to split the branch into commit that adds the dockerfile only & the commits for the template (this is a little tricky to do from memory). That way the Dockerfile can be added more cleanly and #194 closed.

dominikburri commented 2 years ago

Will be fixed in #196, right @yuukiiwa?

SamBryce-Smith commented 2 years ago

This has been addressed in #196 so closing the issue

iRNA-COSI / APAeval