Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples
The QAPA execution workflow was merged during the hackathon, but our standards for execution workflows have been consolidated since then and the QAPA workflow is now out of date. Implementing the changes below will bring QAPA in line with the rest of the workflows.
Workflow should generate custom annotations from provided reference GTF and BED file(s) of polyA sites
Annotation pre-processing is quite clearly documented in the QAPA README, so this shouldn't be too tricky to implement. It seems like all annotation tables can be generated from a reference GTF so I think we should start with that. PolyA site annotations need to be provided as BED files but passing different sources to qapa build is a bit convoluted, so I think we should have separate keys in the config file.
To my mind the following steps are required:
Generate 'gene metadata' table from input GTF.
This is an uncompressed, 5-column, tab-separated file with the following columns, all of which can be obtained from a reference GTF. They suggest downloading from Biomart, but I think it's straightforward for us to generate from the input the workflow needs so I suggest we do that. Here's a quick screenshot of the provided human example file
Generate 'gene prediction' (genePred) table
GenePred files can be generated from an input GTF file as described in the instructions using UCSC tool's gtfToGenePred:
Key decision here is what combination of polyA site annotations to provide to QAPA build, each of which requires a different set of command line flags. It seems the following are possible:
Standard workflow - provide PolyASite BED & GENCODE poly(A) track BED file (it seems both have to be provided, can't have one or the other)
Provide a single custom BED file of polyA sites
Don't provide any polyA sites, just use reference gene/transcript models to produce alternative 3'UTR annotations.
I think we should make the standard workflow the default option. Ideally the Ewf can be flexible to all polyA site annotation combos, but in the interests of time I think we should prioritise the standard workflow first.
Pass extracted 3'UTRs (output of QAPA build) into the rest of the workflow as already implemented.
README & sample-sheet updates
README needs a general update, including but not limited to details on prerequisites, qualifying APAeval challenges, pipeline parameter descriptions, running instructions & citation. One of @faricazjj 's Nextflow workflows (e.g. Dapars) are excellent examples of what we're looking for. References to manually installing QAPA should also be removed (this will be covered by Dockerfiles, see section below too).
Sample sheet should only contain the sample names and paths to FASTQ files. BAM & BAI files are never consumed by QAPA so should be removed entirely. Reference files such as GTFs, polyA site BEDs are shared across all samples provided in the sample sheet, so as universal options should be specified in the pipeline's general config to minimise duplication.
Missing Dockerfile in repository
Already reported by @dominikburri in #194 and has been added in the https://github.com/iRNA-COSI/APAeval/pull/196, which relates to updating the Nextflow template. If this PR cannot be merged quickly, it would be better to split the branch into commit that adds the dockerfile only & the commits for the template (this is a little tricky to do from memory). That way the Dockerfile can be added more cleanly and #194 closed.
The QAPA execution workflow was merged during the hackathon, but our standards for execution workflows have been consolidated since then and the QAPA workflow is now out of date. Implementing the changes below will bring QAPA in line with the rest of the workflows.
Workflow should generate custom annotations from provided reference GTF and BED file(s) of polyA sites
Annotation pre-processing is quite clearly documented in the QAPA README, so this shouldn't be too tricky to implement. It seems like all annotation tables can be generated from a reference GTF so I think we should start with that. PolyA site annotations need to be provided as BED files but passing different sources to
qapa build
is a bit convoluted, so I think we should have separate keys in the config file.To my mind the following steps are required:
GenePred files can be generated from an input GTF file as described in the instructions using UCSC tool's
gtfToGenePred
:Key decision here is what combination of polyA site annotations to provide to QAPA build, each of which requires a different set of command line flags. It seems the following are possible:
I think we should make the standard workflow the default option. Ideally the Ewf can be flexible to all polyA site annotation combos, but in the interests of time I think we should prioritise the standard workflow first.
README & sample-sheet updates
README needs a general update, including but not limited to details on prerequisites, qualifying APAeval challenges, pipeline parameter descriptions, running instructions & citation. One of @faricazjj 's Nextflow workflows (e.g. Dapars) are excellent examples of what we're looking for. References to manually installing QAPA should also be removed (this will be covered by Dockerfiles, see section below too).
Sample sheet should only contain the sample names and paths to FASTQ files. BAM & BAI files are never consumed by QAPA so should be removed entirely. Reference files such as GTFs, polyA site BEDs are shared across all samples provided in the sample sheet, so as universal options should be specified in the pipeline's general config to minimise duplication.
Missing Dockerfile in repository
Already reported by @dominikburri in #194 and has been added in the https://github.com/iRNA-COSI/APAeval/pull/196, which relates to updating the Nextflow template. If this PR cannot be merged quickly, it would be better to split the branch into commit that adds the dockerfile only & the commits for the template (this is a little tricky to do from memory). That way the Dockerfile can be added more cleanly and #194 closed.