simonlabcode / bam2bakR

2 stars 0 forks source link

Isoform stability workflow? #18

Closed aleighbrown closed 2 months ago

aleighbrown commented 3 months ago

We're interested in quantifying isoform specific stability with a custom set of 3'UTRs

If we were to tweak the bam2bakR pipeline, supplementing a custom set of isoforms as separate ‘genes’ to a standard GTF – do you think the bam2bakR + bakR pipeline could handle this data?

thank you

SamBryce-Smith commented 3 months ago

Butting into the thread with a specific question of defining the custom annotation.

Some of the custom set of 3'UTRs/polyA sites were defined using 3'end-sequencing, so we're relying on transcript annotations to generate the 3'UTR/last exon regions. A common issue from annotation is having slightly different 5'end coordinates for annotated last exons e.g. image

It seems like the quantification is done using featureCounts, so I'd anticipate slightly different annotations for the same PAS causing double-counting issues. Do you have any recommendations on how to define the regions in our case where we're just interested in the PAS?

Thanks for your help! Hopefully I've not misinterpreted or missed something from my run through the docs and code...

isaacvock commented 3 months ago

Hi all,

Am I correct to assume that in the example @SamBryce-Smith posted, you would want reads aligning to the portion unique to the top isoform (and potentially also overlapping with the region contained in both UTRs) to be assigned to that PAS, and reads aligning to the region contained in both UTRS to be assigned to the lower PAS (since these should represent 3'-end reads from that PAS)?

If this is correct, then the solution could be:

Step 1: Create a custom annotation as @aleighbrown suggested where each 3'UTR is a separate "gene" with one exon.

Step 2: Run bam2bakR providing this annotation. You will have to tune the featureCounts parameters to allow assignment of reads to multiple features. Currently the featureCounts parameters are hardcoded within the pipeline (kinda dumb on my part), so I will push a change to support tweaking these. Alternatively, this is already implemented on the future successor to bam2bakR, fastq2EZbakR, which despite what the README currently says is stable and nearing an official release, so you could also use that pipeline instead. The config parameter in fastq2EZbakR that I will add to bam2bakR soon should look like:

fc_exons_extra: "--nonOverlap 0 -O"

--nonOverlap 0 won't assign a read to a feature for which any of the read nucleotides don't overlap with the feature. -O allows for multiple feature assignments. One thing to note is that soft-clipped nucleotides are counted as non-overlapping, so it's advisable to turn off soft-clipping in the aligner you are using, or loosen the --nonOverlap cutoff a bit (e.g., to 5) to allow a bit of soft-clipping, with the downside of potential false positive multi-feature assignments.

Step 3: Parse the XF column of the output cB file. You would have to write a custom function to do this which would take as input your custom gtf and your cB. This might look something like:

### Load dependencies
library(dplyr)
library(rtracklayer)
library(readr)

### Load data
cB <- read_csv("path/to/cB.csv.gz")
gtf <- as_tibble(rtracklayer::import("path/to/custom/annotation.gtf"))

### Function to parse feature assignment
parse_PAS <- function(XF, gtf){

  if(grepl("\\+", XF)){

    PASs <- strsplit(XF, "+")

    ### Logic:
    # If read is contained within multiple UTRs, assign it to the UTR
    # that ends before all other UTRS, as it is the most likely origin
    # of that read in a 3'-end sequencing dataset.
    # This is either the UTR with the maximum start coordinate if strand == "-"
    # or the minimum end coordinate if strand == "+".
    XF <- gtf %>%
      dplyr::filter(gene_id %in% PASs &
                      (strand == "-" & start == max(start)) |
                      (strand == "+" & end == min(end))) %>%
      dplyr::select(gene_id) %>%
      dplyr::arrange(gene_id) %>%
      unlist() %>%
      unname()

    # if multiple UTRs with same PAS, arbitrarily take first alphabetically;
    # that's what the arrange() step above was for.
    XF <- XF[1]

  }

  return(XF)

}

### Cute trick to speed things up:
###   only perform parsing on set of XF entries that require parsing.
###   Otherwise you will have to go row by row through the entire cB
###   which could take hours
PAS_dict <- cB %>%
  dplyr::select(XF) %>%
  na.omit() %>%
  distinct() %>%
  rowwise() %>%
  mutate(PAS = parse_PAS(XF))

### Add PAS assignment to cB
cB <- cB %>%
  inner_join(PAS_dict,
             by = "XF")

### bakR only accepts XF feature assignment column, so rename
cB <- cB %>%
  mutate(XF = PAS) %>%
  group_by(sample, XF, TC, nT) %>%
  summarise(n = sum(n))

Let me know if you have any questions or if this doesn't sound like a viable solution for your use case, Isaac

aleighbrown commented 3 months ago

Sweet that's a very comprehension answer, and looks very do-able.

From my understanding of bakR we want to still include the 'normal' transcriptome in the quantification, as well as our custom 'gene/UTRs' rather than just quantifying on the 'novel bits'?

Admittedly my next q is being asked without deep reading of docs or code, but do the Gencode GTFs work ok? They have a slightly definitely terminology for protein coding / nc gene tags

isaacvock commented 3 months ago

To your first question, you should be able to only assign reads to the custom UTRs, as long as you expect there to be decent read coverage at these UTRs to allow for accurate estimation of s4U incorporation rates. If you include a full standard annotation, you will have to do a bit of extra work deconvolving which gene IDs correspond to the PASs you want to assign to and which correspond to regular genes from the original full annotation, so it will actually be easier if you only assign reads to the UTRs of interest.

To your second question, I think we have collaborators that use the Genecode GTFs, so it should be fine. bam2bakR only requires the following things in your gtf:

  1. A column named gene_id
  2. A column named "type" with at least values of "gene" and "exon" (other values can exist, they just won't be used).
  3. A column named seqnames, whose values should match the chromosome names as they appear in your bam files. That is to say, you can't align to a genome from ensembl where the chromosome names don't have the string "chr" at the start, and then provide bam2bakR with a Gencode annotation that does have "chr" at the start of the chromosome names.
  4. Standard start, end, width, and strand columns that should exist in all annotations.

Differences in gene IDs for different types of genes shouldn't cause any problems.