Open SamBryce-Smith opened 1 year ago
Hi Sam,
Thanks for bringing this up. It looks like you identified something similar to a related issue: https://github.com/morrislab/qapa/issues/13#issuecomment-1366899707. The difference being I only provided an interim solution. A PR would be welcome!
For choice of Gencode set, I don't really have a great answer other than it was easier to deal with a simpler annotation set at the time.
Hi,
I was following the workflow to create the 'standard' reference library, but instead starting from an updated gencode reference GTF (human v40) to define initial 3'UTR regions. I followed the suggested workflow (very nicely detailed btw!) as described and the build & salmon index/quant steps worked flawlessly.
However, when running
qapa quant
, I came across the following error:Digging into my outputs a little further, I realised this comes from rare gene IDs in the 'Name' field of
quant.sf
files that contain a '_PAR_Y' suffix. e.g.The extra underscores in PARY lead to a few extra columns being produced from the `strsplit('')` call which causes the error.
Adding a step to e.g. replace underscores with full stops fixes the bug.
Output of create_merged_data.R looks good for these events e.g. SLC25A6:
Although the gene_id suffix is lost in the output table, providing the table as input to compute_pau.R still produces sensible PAU estimates for these genes (i.e. PAU is computed separately for the non-suffixed and suffixed genes respectively). The NumEvents should maybe be adjusted for the different chromosomes but I didn't get around to trying to fix that.
Hope this is clear. Happy to share more tables, information or submit a PR if you wish :) Just a few more general comments:
Thanks, Sam