simonlabcode / bam2bakR

2 stars 0 forks source link

Strandness data speficification for --directional-mapping-reverse option on HISAT3N #17

Closed paulinarosales closed 3 months ago

paulinarosales commented 4 months ago

Hello,

I’m using part of the pipeline for the mutation calling of mRNA PE data, since my data is reverse stranded I implemented my own mapping step to add some options to HISAT3N like the --directional-mapping-reverse argument instead of the regular --directional-mapping nevertheless when using the mut_call.py script and specifying the strandness as R I get unexpected results like no T>C conversions and high A>G conversions on my ERCCs spike-ins.

Due to this I’m wondering if using the strandedness = R argument for the mutation calling on top of the --directional-mapping-reverse in the alignment is adding an unnecessary complement-making step or if there is any recommendation to use this script/tool with the options for reverse stranded data.

Thanks!

isaacvock commented 4 months ago

Hello,

I will spend some time today looking into this and get back to you.

What I can say right now is that the first person I alpha-tested bam2bakR with 2 years ago was using hisat-3n aligned bams with the same directional mapping setting you mentioned, and ran into the same problem of their library seemingly being converted from reverse to forward stranded. In their case there might have been some other things they were doing that were the actual root cause of this quirk, but I can't remember as it was so long ago. That's just to say, there is evidence that the hisat-3n settings you are using do in fact flip the reads or reverse complement them so as to make it effectively a forward stranded library.

Best, Isaac

paulinarosales commented 4 months ago

Thanks for the quick answer! Also I forgot to add that I also use the reverse strandness parameter for featureCounts (for read attribution ) as -s 2 and the percentage of attributed counts is much higher than when using the option for forward stranded, so my guess us that the HISAT3N option for reverse mapping is not "flipping" the reads, but would be nice to now where the difference is coming from. I hope this is helful :)

isaacvock commented 4 months ago

I was going to ask you how featureCounts read assignment went, that is good to know but thickens the plot and definitely confuses me more.

I ran some tests but unfortunately was not able to reproduce the mutation counting anomaly. I aligned the same reverse stranded, paired-end fastq file using hisat3n with 5 different settings:

# Directional mapping reverse
hisat-3n <basic args> --directional-mapping-reverse

# Directional mapping default
hisat-3n <basic args> --directional-mapping

# Directional mapping reverse and rna-strandness set
hisat-3n <basic args> --directional-mapping-reverse --rna-strandness RF

# Only set rna-strandness
hisat-3n <basic args> --rna-strandness RF

# Default settings for everything
hisat-3n <basic args>

Where <basic args> for me was:

-p 4 -x indices/chr21_test -q -1 WT1_test_R1.fastq.gz -2 WT1_test_R2.fastq.gz --base-change T,C

I built indices with the following code (nothing fancy like splice-site and exon specification for the sake of rapid testing):

hisat-3n-build -p 4 genome.fasta chr21_test

In all cases, using bam2bakR, I had the expected high T-to-C mutation rate. Default directional mapping had a lower T-to-C rate, slightly lower alignment rate, and slightly higher A-to-G mutation rate than the other settings, but the T-to-C rate was an order of magnitude higher than A-to-G mutation rates in all cases.

For these tests, I used version 2.2.1-3n-0.0.3 of HISAT-3n and the current version of bam2bakR on the main branch of this repo.

Some thoughts:

  1. You could send me a downsampled version of one your bam files and I could see if setting strandedness to R in bam2bakR leads to the same behavior for me.
  2. While working on this, I was able to dig up some old emails from the person who was having similar troubles previously. In there case it seems to have turned out that they just had the strandedness of their library incorrect. So it may not have been a hisat-3n problem in that case, though I still remember us going back and forth about hisat-3n for some reason that may not have made it to any emails. I guess I would suggest double checking the strandedness of your libraries, but I suspect this isn't the problem in your case. NOTE: I typed this up before I saw your featureCounts message, which pretty much rules out any strandedness misunderstandings.
  3. From your message, am I right to assume that you are just calling the mut_call.py script directly rather than running your bam file through the full pipeline? If so, that's the only difference between what you and I are doing, and thus it may be worth trying to run bam2bakR in its entirety (starting from your bam files, of course). That shouldn't make a difference, but it would be a good sanity check. [EDIT]: thinking about it a bit more, if you are in fact calling the raw mut_call.py script, then the most parsimonious explanation of the results you are seeing is that you are failing to overwrite the default setting of the --strandedness parameter, which is 'F'. So maybe its something small like spelling of the parameter or how you are setting it if/when you are calling the script directly.

Best, Isaac

paulinarosales commented 3 months ago

Thank you very much for going through the task of testing the different settings. Regarding the --strandedness I did check that the parameter is switched to R, so that also shouldn't be the problem. My first guess was the switch on strandness but after our discussion and also testing it on my own I can eliminate the possibility of the issue being related to the alignment and mutation calling. I hope the issue can help as reference for users working with reverse stranded data. Thanks again for the quick and helpful answer.

Best, Paulina

isaacvock commented 3 months ago

Sorry to bother you with a follow-up, but can you clarify whether or not you were able to solve the problem, and if so, what the solution was?

Best, Isaac