iRNA-COSI / APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples
MIT License
13 stars 14 forks source link

feat(Execution workflows / Utils): making RefSeq annotations work for QAPA and PAQR. #457

Open mrgazzara opened 1 year ago

mrgazzara commented 1 year ago

In trying to run the more conservative RefSeq annotation on certain tools, I've run in to an issue where the tools assume specific attributes (e.g. Ensembl type attributes for QAPA or Ensembl/Gencode for PAQR) to be present in order to build their annotation.

I need to check a bit deeper to make sure I gather all the requirements. Not sure if it is best to make a Util script to transform a RefSeq gtf so it contains theses required Ensembl/Gencode attributes or if we should add an option within each EWF to handle this.

dominikburri commented 1 year ago

Hi @mrgazzara, I try to look into PAQR, but it's not quite clear yet what the adaptation of refseq to gencode needs to look like. The thing I find in the config is this: https://github.com/iRNA-COSI/APAeval/blob/50fb980e76b326ae0daae0bf9f1783886b26fc66/execution_workflows/PAQR/config/config.PAQR.yaml#L43 which is for the gencode annotation and could be replaced by gbkey (=mRNA) or gene_biotype (but this is only on type, col 3 in gff). Those seem to be the replacement keys.

But you mentioned that you would need to change code within. Can you specify what exactly? What is assumed in the gtf that is not present in the refseq annotation?

What I see in tandem pas, is that it looks for gene_id: https://github.com/zavolanlab/tandem-pas/blob/98478da7f0f2cb15e8df777a405840261ed055f0/scripts/mz-select-pas-subset.pl#L130 which is not available in refseq. It could be replaced by ID=gene- though - assuming gff format. Or maybe by checking column 3 (gene, exon, CDS, mRNA/transcript [refseq,gencode]). transcript_id is available in both gencode and refseq.

ninsch3000 commented 1 year ago

Hey @mrgazzara, For PAQR I can confirm what @dominikburri writes above, the config.PAQR.yaml has to be adapted ("transcript_type" for GENCODE, "transcript_biotype" for ensembl). I couldn't find the refseq annotation file you're using in APAeval, but just searching for one myself I came across a gtf here. At a first glance, the fields "gene_id" and "transcript_biotype" seem to be present, so we should theoretically be able to adapt PAQR.

Could you maybe specify in more detail what seems to be the problem? Or provide your refseq annotation somewhere?

mrgazzara commented 1 year ago

Sorry for the delays. Getting back to APAeval now.

The RefSeq annotations Farica and I used for running/re-running the other tools were from UCSC genome browser. Taking another look they were actually the "refGene" annotations, which seem to be RefSeq but just for annotated protein coding and non-coding genes (accessions with NM* and NR*, see documentation here. These files can be downloaded through UCSC ftp for human hg38.refGene.gtf.gz or mouse mm10.refGene.gtf.gz.

This file format seemed to be to most closely match the ones DaPars used as their recommended input. I can try the file CJ linked to to see if it works with PAQR / QAPA (I think it might) and we can either use this as is, subset this file, or adapt the refGene gtf to have the necessary fields.