simonlabcode / bam2bakR

2 stars 0 forks source link

How to handle pro-seq Timelapse data? #6

Closed WangJingwen21 closed 1 year ago

WangJingwen21 commented 1 year ago

Dear Isaac, I'm sorry to bother you again. I don't know much about transcription pausing. Now we have employed pro-seq-Timelapse-seq in DLD1 cell line, we want to calculate dynamic parameter as you did in your paper. We understand that TSScall methods used in your pipeline are not fitful for pro-seq data, so, we should clean our bam file to get TSS specific reads before running ban2bakR. We want to define PolII pausing zone for each gene. However, we have no clue how to establish the process. Could you gave us some advice.

isaacvock commented 1 year ago

As you all are (to my knowledge) the first to develop PRO-seq + TimeLapse-seq, it's challenging for me to make detailed recommendations as to how to process and analyze your unique data. The suggestions that follow are mostly generic suggestions drawn from literature on analyzing PRO-seq data.

To determine the pausing zone, there seem to be two strategies that people have used:

  1. De novo identification of pause sites using your PRO-seq data and a tool like HOMER.
  2. Define it as within 100-200 nucleotides of the furthest upstream annotated start site. Annotations could come from resources like Ensembl or RefSeq, or could be assembled using aligned fastqs as input to tools like Scallop, StringTie, SQUANTI3, etc.

The Guertin lab also has some ideas on their website about how to call TSSs with a tool they developed, described here.

No matter how you define the pause site, bam2bakR can be used to count mutations in your sequencing reads. For example you could provide a GTF file that has information about the location of pause sites and gene bodies for each gene. The tricky thing is that bam2bakR is currently hard-coded to expect the GTF type column to have "exon" and "transcript". In the cB file, the GF column is the associated gene_id for all reads that mapped to any part of a transcript, the EF column is the gene_id for all reads that mapped to any part of an exon, and the XF column is the gene_id for all reads that mapped exclusively to exonic regions. You could thus modify your pause-site and gene-body GTF such that the annotated pause sites are labeled as type "transcript" and the gene bodies are labeled as type "exon".

If you did this, the cB file from bam2bakR would thus have:

  1. A GF value of "__no_feature" and an EF value of "" for any set of reads that mapped exclusively to the gene body
  2. A GF value of "" and an EF value of "__no_feature" for any set of reads that mapped exclusively to the pause region
  3. A GF and EF value of "" for any set of reads that mapped to both the pause region and the gene body.
WangJingwen21 commented 1 year ago

Thank you so much for your advice! I'll talk to my team.