qapa quant / create_merged_data.R - Error splitting name field when gene IDs have _PAR_Y suffix

Hi,

I was following the workflow to create the 'standard' reference library, but instead starting from an updated gencode reference GTF (human v40) to define initial 3'UTR regions. I followed the suggested workflow (very nicely detailed btw!) as described and the build & salmon index/quant steps worked flawlessly.

However, when running qapa quant, I came across the following error:

Separating Ensembl IDs
Error in `[.data.table`(df, , `:=`(c("Transcript", "Gene", "Species",  :
  Supplied 9 columns to be assigned 11 items. Please see NEWS for v1.12.2.
Calls: separate_ensembl_field -> [ -> [.data.table
Execution halted

Digging into my outputs a little further, I realised this comes from rare gene IDs in the 'Name' field of quant.sf files that contain a '_PAR_Y' suffix. e.g.

ENST00000355432_ENSG00000198223.18_PAR_Y,ENST00000381529_ENSG00000198223.18_PAR_Y,ENST00000494969_ENSG00000198223.18_PAR_Y_hsa_chrY_13094011309921+_utr_1309868_1309921::chrY:1309401-1309921(+)

The extra underscores in PARY lead to a few extra columns being produced from the `strsplit('')` call which causes the error.

Adding a step to e.g. replace underscores with full stops fixes the bug.

# Replace _PAR_Y suffix of gene IDs if applicable - since split by '_', need to replace first to avoid a parsing error (but still retain distinct ID)
    # Format Ensembl Transcript and Ensembl Gene IDs if there are multiple
    # Remove utr tag
    df[, Transcript := str_replace_all(Transcript , "_PAR_Y", ".PARY") %>%
           str_extract(tx_pattern) %>%
           format_multi_ensembl_ids() %>%
           #str_replace("_(hg\\d+|mm\\d+|unk)", "") %>%
           str_replace("_utr", "")]

Output of create_merged_data.R looks good for these events e.g. SLC25A6:

Transcript	Gene	Gene_Name	Chr	LastExon.Start	LastExon.End	Strand	UTR3.Start	UTR3.End	Length	ctl_1
ENST00000381401	ENSG00000169100	SLC25A6	chrX	1386139	1386759	-	1386139	1386601	462	159.140895
ENST00000381401	ENSG00000169100	SLC25A6	chrX	1386251	1386759	-	1386251	1386601	350	0
ENST00000381401	ENSG00000169100	SLC25A6	chrX	1386293	1386759	-	1386293	1386601	308	450.639916
ENST00000381401	ENSG00000169100	SLC25A6	chrY	1386151	1386759	-	1386151	1386601	450	41.407935

Although the gene_id suffix is lost in the output table, providing the table as input to compute_pau.R still produces sensible PAU estimates for these genes (i.e. PAU is computed separately for the non-suffixed and suffixed genes respectively). The NumEvents should maybe be adjusted for the different chromosomes but I didn't get around to trying to fix that.

APA_ID	Gene_Name	Num_Events	Transcript	Gene	Chr	LastExon.Start	LastExon.End	Strand	UTR3.Start	UTR3.End	Length	ctl_1.PAU
ENSG00000169100_1_D	SLC25A6	4	ENST00000381401	ENSG00000169100	chrX	1386139	1386759	-	1386139	1386601	462	26.098
ENSG00000169100_2_D	SLC25A6	4	ENST00000381401	ENSG00000169100	chrX	1386251	1386759	-	1386251	1386601	350	0
ENSG00000169100_3_P	SLC25A6	4	ENST00000381401	ENSG00000169100	chrX	1386293	1386759	-	1386293	1386601	308	73.902
ENSG00000169100_4_S	SLC25A6	4	ENST00000381401	ENSG00000169100	chrY	1386151	1386759	-	1386151	1386601	450	100

Hope this is clear. Happy to share more tables, information or submit a PR if you wish :) Just a few more general comments:

In the README ('GENCODE gene prediction annotation table' section) you pull from Gencode 'basic' annotation only, whereas I've used the 'comprehensive gene annotation' GTF (reference chromosomes only). Do you advise against using the comprehensive annotation for any reason(s)?
- SLC25A6 does have the 'basic' tag even for _PAR_Y (e.g. ENST00000381401.11_PAR_Y), so this wouldn't necessarily avoid the problem in this case

Thanks, Sam

morrislab / qapa

qapa quant / create_merged_data.R - Error splitting name field when gene IDs have _PAR_Y suffix #49