t-neumann / slamdunk

Streamlining SLAM-seq analysis with ultra-high sensitivity
GNU Affero General Public License v3.0
38 stars 23 forks source link

stranded library alignment #149

Closed ZiggeyQi closed 6 months ago

ZiggeyQi commented 6 months ago

Hi Tobias:

In my experiment(1mM 4sU pulse for 30 min), before library construction, all the steps followed the standard protocol of SLAMseq, after total RNA extraction the rRNAs were removed from the 4sU lebleled total RNA, then the RNA library was constructed by NEBNext® UltraTM RNA Library Prep Kit for Illumina®(a dUTP drived stranded-specific library kit), sequencing were pair-end. Due to the single end reads alignment and the splicing unawared alignment of NGM, i have to use hisat3N to align the reads to genome. Now, the problem is that based on my library construction stratage what base conversion should i set(T2C or A2G), because only the 1th synthesised cDNA strand will sequenceing(R2 reads were the forward orientation). I use T2C base conversion to align my data and get about 10% nascent reads(contain the matched conversion, the Yf-tag >0), but A2G base conversion aligment will get about one third nascent reads and the total mapping percentage will increasing compared to T2C conversion alignment then more aligned reads were got, well, in general how many nascent reads we can get without any experimental treatment. The final nascent reads were got by in house script which will filter the VarScan2 found snp(T2C or A2G) site as a conversion event. Because one of my group treated with the transcription inhibitor, so the nascent reads should be less than control group, the final BAM file size is in line with our anticipation, but when i use those BAM to profile the reads onto all the protein coding genes(TSS-3K~TES-5K), the reads ditribution of T2C conversion alignment somewhat match the expectation but the A2G conversion aligment even totally reversed but the BAM file size was matched to the expectation. Due to the absent of 4sU unlabeled sample were prepared for sequencing, I don't know the results now i get is correct or not. I don‘t konw should i put the question here or not, those results frustrated me a lot, could you give my some advice about how should i handle my SLAM sequencing data to get the final nascent reads and reads' transcribing orientation, or there is might some mistakes in my experimental steps in SLAMseq resultting the above weird results. Looking forward to your reply.

Best, Ziggey

isaacvock commented 6 months ago

Hi Ziggey,

I personally struggle to understand several of the details you list regarding your comparison of the two alignment strategies, but I can confidently say that if the hisat-3n documentation is to be trusted, you should be setting --base-change T,C in your call to the index building utility, not A,G.

Some of your observations may be explained by an increased misalignment rate when mapping to an A,G converted genome, since it is a lower complexity genome without the advantage of removing penalization for expected T-to-C mismatches. I recognize that you may be confused by the fact that you expect both A-to-G and T-to-C mutations in your reads due to one read representing the original RNA sequence and the other its reverse complement, but my understanding is that this is accounted for by hisat-3n when setting base-change to T,C.

Best, Isaac

ZiggeyQi commented 6 months ago

Hi Isaac,

Thanks for your kindly reply, it do have the posibility that the increased mismatch of A2G conversion mapping to the lower complexity genome, so the A2G mapping rating increased. When i eyes on the reads mapping quality, all the reads mapping quality were 1 or 60, no any other values appeared, i don't know it's normal or not. Filtering the reads whose mapping quality lower than 30, both T2C and A2G conversion alignment get about 45% reads from the total mapped reads, there is no significant difference between the global mapping quality of T2C and A2G conversion alignment.

Hisat-3N takes the reverse strand A2G converion into account as the T2C conversion event, but my query is that based on the principle of library construction and sequencing, if the library is non-stranded and sequencing is paired end, the R1 reads should be the forword orientation, the R2 reads are the reverse orientation, but in my library the 2th synthsised cDNA marked by the dUTP was digested, only the 1th synthesised cDNA strand will sequenceing, so the R2 reads of my data are the forward orientation and the R1 reads are the reverse orientation. Based on the illustration provided in the SLAMseq method devolopment paper, the 1th synthesised cDNA strand contain the A2G conversion, the 2th synthesised cDNA strand contain the T2C conversion, after PCR the sequencing process will get the T2C conversion, but the 2th synthesised cDNA strand was removed in my library, then the 1th synthesised cDNA strand was sent to PCR and sequencing. image Dose those discrepancy will affect the mapping strategy selection, and i don't known the 3'-sequencing is stranded or not in the standarded SLAMseq protocol, if the 3'-sequencing is also stranded library, and the 1th synthsised cDNA strand will sequencing, then i should choose the T2C conversion alignment. I read the instruction of the 3'-sequencing library kit, but i'm still not sure it's stranded or not.

Best, Ziggey

isaacvock commented 6 months ago

Hi Ziggey,

Thank you for your clarification, I am starting to better understand your queries, though I still am a bit confused.

if the library is non-stranded and sequencing is paired end, the R1 reads should be the forword orientation, the R2 reads are the reverse orientation

This is incorrect. If the library is unstranded, then about 50% of the time the R1 reads will represent the "forward orientation" (original sequence of the RNA" and 50% of the time they will represent the "reverse orientation" (the reverse complement sequence of the RNA). In either case, hisat-3n does not care and is completely unimpacted by the strandedness of your library. Thus the answer to your question of if your library prep (which is highly standard and equivalent to what our lab has been using for many years) "will affect the mapping strategy selection" is no.

Maybe I am misunderstanding what you mean when you mention "mapping strategy" though. While alignment is completely unimpacted by the strandedness of your library, one thing that could be influenced by the points you note is the mutation counting performed by SLAMDUNK. Strandedness will impact whether a mutation is called a T-to-C mutation or an A-to-G mutation, as the strandedness will be used to decide whether or not an observed mismatch in the aligned read needs to be complemented to represent the original RNA sequence or not. The 3'-end Quant-seq library preparation is stranded, but the inverse of what your strandedness is. That is, the first read (though usually it is single-end so you could equivalently say the only read) is in the forward orientation, as mentioned in #14. The suggestion given in that issue is to reverse complement your reads before using SLAMDUNK. In your case, you could reverse complement them prior to alignment by hisat-3n and then provide the resulting bam files to SLAMDUNK.

Best, Isaac

ZiggeyQi commented 6 months ago

Hi Isaac,

Your interpretation is really helped me, and correct my misunderstanding about non-stranded library. So, because my library is a reverse strand cDNA, if using single end sequencing to got the transcripts and SLAMDUNK to quantify them, i should reverse complement the raw reads to use SLAMDUNK to get the accurate quantity of T>C conversion reads. But my sequencing is paired end, so the R1 reads were the reverse complement of the R2 reads(the outputting reads of the single end sequencing), therefore, i don't need reverse complement my reads, just using Hisat-3N to map my data by paired-end model with the T>C conversion index, then we can get the raw nascent reads by selecting the reads who Yf-tag>0. My “Mind map” depict as the below illustration, please help me to check its right or not. image Well, the polyA tail enrichment step was omitted in my library preparation, because i wanna focus on the global transcription state of all genes but not just the output of the protein coding genes' transcription process. And, i also need to separate the sense and antisense transcripts, so i can not use SLAMDUNK to quantify my nascent reads, could you give me some advicese to get my research goals. Thanks again for your kindly and helpful interpretation and suggestion, i learn a lot.

Best, Ziggey

isaacvock commented 6 months ago

Hi Ziggey,

Your depiction of the library prep seems accurate, but the details are irrelevant to the task at hand. Your options for analyzing paired-end SLAM-seq data with reverse strandedness are as follows:

  1. Reverse complement your R1 reads, align the R1 and R2 reads separately with HISAT-3N treating them as single-end, and pass them individually to SLAMDUNK for mutation calling. See #57 and #147 for corroboration of this strategy.
    • Advantages: SLAMDUNK is widely used, user-friendly software that you seem to already be familiar with and thus will probably be the easiest strategy for your to implement quickly
    • Disadvantages: 1) This strategy does not make full effective use of the paired-end nature of your data. 2) SLAMDUNK forces you to resort to a strategy of calling all reads with a certain number of reads "nascent", which is statistically sub-optimal. Reads from old RNA can have mutations due to sequencing, alignment, and RT/PCR errors. Reads from nascent RNA can have no mutations due to s4U having to compete with regular uridine for incorporation into RNA. Thus, it is better to use a strategy called mixture modeling originally introduced in the methods section of this paper and discussed in more detail here and here.
  2. Use a separate pipeline that I developed called bam2bakR which was originally designed to process the same kind of libraries (paired-end, reverse stranded) that you are working with. bam2bakR filters out multimapping and unmapped reads, assigns reads to genes, counts mutations in reads, and generates a summarized table that can be passed to an R package that I also developed called bakR to analyze the data.
    • Advantages: 1) bakR uses the more statistically optimal mixture modeling strategy discussed in the disadvantages to strategy 1. 2) bam2bakR provides the data in a form that can either be analyzed with existing tools like bakR, or that can be analyzed with novel in-house strategies as you see fit. That is because unlike options 1) and 3), it provides convenient access to the read-by-read mutational data in a compressed table that follows the principles of tidy data.
    • Disadvantages: 1) bam2bakR is not as computationally efficient as option 3), 2) bam2bakR is a Snakemake workflow as opposed to a convenient command line tool like SLAMDUNK or the software discussed in option 3), making it a bit clunkier to work with.
  3. Use GRAND-SLAM to assign reads to genes and estimate the fraction of reads that are from RNA synthesized during the s4U labeling (i.e., nascent reads).
    • Advantages: 1) GRAND-SLAM like bakR uses the more statistically optimal mixture modeling strategy. 2) GRAND-SLAM is highly performative and is able to rapidly process many individual bam files. 2) GRAND-SLAM has a companion R package called grandR which makes it easy to analyze its output.
    • Disadvantages: 1) GRAND-SLAM, in my opinion, overprocesses the data providing only highly processed statistical estimates (i.e., estimates of the fraction of sequencing reads from each gene that came from nascent RNA), which can make debugging analyses difficult. 2) The documentation on the GRAND-SLAM website and in their paper is a bit scant in terms of discussion of the inner-workings of GRAND-SLAM and how it implements the statistical models that it uses.

And, i also need to separate the sense and antisense transcripts

All three of these strategies will only analyze reads from sense transcripts, throwing out reads (or in the case of option 2) labeling antisense reads as "unassigned") that come from antisense transcription.

Let me know if you have any questions, Isaac

ZiggeyQi commented 6 months ago

Hi Isaac,

You are super nice! Thanks a lot for your exhaustive interpretation. I‘ll take your recommondation for using SLAMDUNK to get the count, but i still curious about Reverse complement my R1 reads, but not R2 reads. Because my library is reverse strand cDNA, so the R2 reads represent the reverse strand and the R1 is the forward, the SLAMDUNK processing the forward single end reads, so i should reverse complement my R2 reads and keep R1 unchanged for alignment, which could matched the count step of SLAMDUNK properly.

The developing version of bam2bakR adopt Featurecounts to counting the interest reads, so can i just using the SLAMDUNK filter and snp to get the final BAM(filtering the snp and met the threshold of a difined nascent reads by SLAMDUNK, even this is not the best way to recognize nasent reads, due to the reasons you mentioned above), and based on the GTF annotation to get all the reads located in the genes' upsteam 3k to down stream 5k and defining the sense and antisense reads to get the separate BAM files, finally inputting them to the Featurecounts to get the count, then i can get the sense and antisense gene count.

Best, Ziggey

isaacvock commented 6 months ago

Hi Ziggey,

Reverse strandedness means read R2 represents the RNA's sequence, not its reverse complement. Read R1 represents the reverse complement of the RNA's sequence. SLAMDUNK assumes the read it analyzes represents the original sequence of the RNA. Thus, you must reverse complement read R1 but not read R2.

Your proposed procedure for delineating sense and antisense reads sounds reasonable.

Best, Isaac

ZiggeyQi commented 6 months ago

Hi Isaac,

I was confused about some details, but now i got it. With your comprehensive explanation, i finally figured out all the details. You saved me from frustration, thank you again for your kind and helpful interpretation.

Best wishes to you, Ziggey