morrislab / qapa

RNA-seq Quantification of Alternative Polyadenylation
GNU General Public License v3.0
41 stars 10 forks source link

qapa quant / create_merged_data.R - Error splitting name field when gene IDs have _PAR_Y suffix #49

Open SamBryce-Smith opened 1 year ago

SamBryce-Smith commented 1 year ago

Hi,

I was following the workflow to create the 'standard' reference library, but instead starting from an updated gencode reference GTF (human v40) to define initial 3'UTR regions. I followed the suggested workflow (very nicely detailed btw!) as described and the build & salmon index/quant steps worked flawlessly.

However, when running qapa quant, I came across the following error:

Separating Ensembl IDs
Error in `[.data.table`(df, , `:=`(c("Transcript", "Gene", "Species",  :
  Supplied 9 columns to be assigned 11 items. Please see NEWS for v1.12.2.
Calls: separate_ensembl_field -> [ -> [.data.table
Execution halted

Digging into my outputs a little further, I realised this comes from rare gene IDs in the 'Name' field of quant.sf files that contain a '_PAR_Y' suffix. e.g.

ENST00000355432_ENSG00000198223.18_PAR_Y,ENST00000381529_ENSG00000198223.18_PAR_Y,ENST00000494969_ENSG00000198223.18_PAR_Y_hsa_chrY_13094011309921+_utr_1309868_1309921::chrY:1309401-1309921(+)

The extra underscores in PARY lead to a few extra columns being produced from the `strsplit('')` call which causes the error.

Adding a step to e.g. replace underscores with full stops fixes the bug.

# Replace _PAR_Y suffix of gene IDs if applicable - since split by '_', need to replace first to avoid a parsing error (but still retain distinct ID)
    # Format Ensembl Transcript and Ensembl Gene IDs if there are multiple
    # Remove utr tag
    df[, Transcript := str_replace_all(Transcript , "_PAR_Y", ".PARY") %>%
           str_extract(tx_pattern) %>%
           format_multi_ensembl_ids() %>%
           #str_replace("_(hg\\d+|mm\\d+|unk)", "") %>%
           str_replace("_utr", "")]

Output of create_merged_data.R looks good for these events e.g. SLC25A6:

Transcript Gene Gene_Name Chr LastExon.Start LastExon.End Strand UTR3.Start UTR3.End Length ctl_1
ENST00000381401 ENSG00000169100 SLC25A6 chrX 1386139 1386759 - 1386139 1386601 462 159.140895
ENST00000381401 ENSG00000169100 SLC25A6 chrX 1386251 1386759 - 1386251 1386601 350 0
ENST00000381401 ENSG00000169100 SLC25A6 chrX 1386293 1386759 - 1386293 1386601 308 450.639916
ENST00000381401 ENSG00000169100 SLC25A6 chrY 1386151 1386759 - 1386151 1386601 450 41.407935

Although the gene_id suffix is lost in the output table, providing the table as input to compute_pau.R still produces sensible PAU estimates for these genes (i.e. PAU is computed separately for the non-suffixed and suffixed genes respectively). The NumEvents should maybe be adjusted for the different chromosomes but I didn't get around to trying to fix that.

APA_ID Gene_Name Num_Events Transcript Gene Chr LastExon.Start LastExon.End Strand UTR3.Start UTR3.End Length ctl_1.PAU
ENSG00000169100_1_D SLC25A6 4 ENST00000381401 ENSG00000169100 chrX 1386139 1386759 - 1386139 1386601 462 26.098
ENSG00000169100_2_D SLC25A6 4 ENST00000381401 ENSG00000169100 chrX 1386251 1386759 - 1386251 1386601 350 0
ENSG00000169100_3_P SLC25A6 4 ENST00000381401 ENSG00000169100 chrX 1386293 1386759 - 1386293 1386601 308 73.902
ENSG00000169100_4_S SLC25A6 4 ENST00000381401 ENSG00000169100 chrY 1386151 1386759 - 1386151 1386601 450 100

Hope this is clear. Happy to share more tables, information or submit a PR if you wish :) Just a few more general comments:

Thanks, Sam

kcha commented 1 year ago

Hi Sam,

Thanks for bringing this up. It looks like you identified something similar to a related issue: https://github.com/morrislab/qapa/issues/13#issuecomment-1366899707. The difference being I only provided an interim solution. A PR would be welcome!

For choice of Gencode set, I don't really have a great answer other than it was easier to deal with a simpler annotation set at the time.