Single-End vs Paired-End behaviour for split1/2, discordant and coverage counts

ahwanpandey commented 4 months ago

Hello,

Firstly, thanks for sharing this awesome tool!

I have some targeted Qiagen Fusion XP data that is paired end (R1=230bp x R2=70bp) with about 50bp of R2 that is usable after trimming UMIs and other sequences. The data has been mapped using STAR and deduplicated using UMI-tools. The library was made to enrich some known fusion genes with hopes of also finding some novel fusions.

The example below is for a novel fusion. ABCB1 is the known fusion gene which was part of the panel design, whereas TMEM243 is the novel partner gene. To do some comparisons, I first mapped the full cleaned and trimmed raw data and then extracted just the reads mapping at the novel TMEM243 fusion loci. Then remapped the paired end reads separately again, and just R1 and just R2 separately. Not sure if this is relevant information for my question but just wanted to summarise everything I had done.

The STAR version I am using is 2.7.11b and the Arriba version is 2.4.0

The STAR mapping parameters are:

$STAR_PATH \
--runMode alignReads \
--runThreadN 16 \
--genomeDir $STAR_GENOME_DIR \
--readFilesIn $CLEANED_FASTQ_R1_FILES_wUMI $CLEANED_FASTQ_R2_FILES_wUMI \
--readFilesCommand zcat \
--limitBAMsortRAM 16000000000 \
--outFileNamePrefix $OUTPUT_DIR/Alignments/$SAMPLE_NAME. \
--outReadsUnmapped Fastx \
--outSAMtype BAM Unsorted \
--outSAMmode Full \
--outSAMattributes All \
--outSAMunmapped Within \
--outSAMprimaryFlag AllBestScore \
--outFilterScoreMinOverLread 0.1 \
--outFilterMatchNminOverLread 0.1 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--alignSJoverhangMin 9 \
--alignSJDBoverhangMin 3 \
--alignSJstitchMismatchNmax 5 -1 5 5 \
--alignSplicedMateMapLmin 12 \
--alignSplicedMateMapLminOverLmate 0.5 \
--outFilterMultimapNmax 50 \
--chimSegmentMin 10 \
--chimJunctionOverhangMin 10 \
--chimOutType WithinBAM SoftClip \
--chimSegmentReadGapMax 3 \
--chimScoreDropMax 30 \
--chimScoreSeparation 1 \
--chimScoreJunctionNonGTAG 0 \
--chimMultimapNmax 50 \
--peOverlapNbasesMin 10 \
--peOverlapMMp 0.01 \
--twopassMode Basic

The Arriba parameters are:

$ARRIBA_PATH \
-x deduplicated.bam \
-g $GTF_FILE \
-a $GENOME_FASTA \
-b blacklist_hg38_GRCh38_v2.4.0.tsv.gz \
-k known_fusions_hg38_GRCh38_v2.4.0.tsv.gz \
-p protein_domains_hg38_GRCh38_v2.4.0.gff3 \
-o fusions.tsv \
-O fusions.discarded.tsv \
-s no \
-i 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X \
-E 0.3 \
-S 2 \
-R 10000 \
-A 23 \
-M 4 \
-U 1000 \
-u

Mapping using both R1 and R2 together as paired end shows the following in IGV at the breakpoints:

Mapping using R1 only shows the following in IGV at the breakpoints:

Mapping using R2 only shows the following in IGV at the breakpoints:

Running Arriba for the above 3 BAMs gives the following result: I have highlighted in red the relevant part of my question

Is this the correct behavior or am I doing something wrong? I'm not sure why "split_reads2" and "coverage2" are so different when using paired end and just R1 by itself? I understand there might not be any discordant reads in this example but there are many in other fusions so I believe paired end is the ideal way to go for our data, and also the recommended way to find fusions.

Ultimately I just wanted to understand if this makes sense and how i could go about rescuing or dividing those those "split_reads2" and "coverage2" counts appropriately in the paired end data.

Thanks for your time! Ahwan

suhrig commented 4 months ago

This is the expected behavior.

The assignment of a read to split_reads1 or split_reads2 is made on the basis of where the most nucleotides map. For example, given a read length of 100nt, if 51nt map to gene1 and 49nt map to gene2, then the read will be attributed to split_reads1. This is how Arriba does it for single-end reads.

For pairend-end reads, the decision is made according to where the anchor read maps, which results in pretty much the same behavior. With paired-end reads, you have one read mapping fully to a gene (the anchor), part of the mate mapping to the same gene (the actual split read), and a supplementary alignment mapping to the other gene.

So in one sentence: the counts differ between single-end and paired-end, because the latter takes both mates into account.

And to answer your question: paired-end is definitely better for fusion detection than single-end!

Let me know if anything is still unclear.

ahwanpandey commented 4 months ago

@suhrig thanks for your explanation. I see...

Definitely paired end is better! But in this case seems treating as single end and using just the longest of the pair I.e. R1 is giving better results? I mean using just R1 the coverage and split reads at both the genes seems to be balanced and the confidence is high instead of medium as in the paired end result.

suhrig commented 4 months ago

The confidence is high for the single-end method only because the reads are balanced across split_reads1 and 2. Either way the fusion would be detected for sure, given the good read support.

I agree it's a bit odd that the paired-end method yields fewer supporting reads. Have you tried to figure out where they went? Did STAR fail to align some of the reads?

I guess you cannot share the sample, because it's patient data. Otherwise, I could have taken a look myself.

ahwanpandey commented 4 months ago

Hi @suhrig I've shared the FASTQ files. As mentioned in my fist post (and re-iterated below) these are just reads that initially mapped to the TMEM243 loci.

"I first mapped the full cleaned and trimmed raw data and then extracted just the reads mapping at the novel TMEM243 fusion loci"

https://www.dropbox.com/scl/fo/kss6xp5lg9yfw76vmt4k7/h?rlkey=hb9uieje2n9onud5m9bqavysf&dl=0

suhrig commented 4 months ago

Thank you for the FastQ files. I believe I have figured out what is going wrong here. It's an alignment issue. STAR fails to align many of the reads, when run in paired-end mode. I suspect it has to do with the imbalanced length of read 1 vs. read 2. As read 1 is so much longer than read 2, STAR is likely to find more seeds for read 1, making it a more promising starting point to commence full alignment. It then aligns read 1 as much as it can until the fusion junction. The problem is that it needs to align the rest of read 1 AND all of read 2 to the the other fusion gene. It would have to create a supplementary alignment for both the remainder of read 1 as well as for the entire read 2. As far as I am aware STAR does not support this kind of alignment. Chimeric alignments always have to be structured in a way that there must be a normal alignment for both read 1 and read 2, and a supplementary alignment for a fraction of only one of the two reads.

Here is a workaround, which should be very effective: You can first try to align all reads in paired-end mode as usual. Let's call this output file paired-end.bam. Next, extract all unaligned reads from the paired-end BAM file:

samtools view -f4 paired-end.bam |
awk '{print "@"$1"_realign_se_"NR"\n"$10"\n+\n"$11}' |
gzip > realign_se.fastq.gz

It is important that the read names be modified such that they are not the same as the original ones (hence, the above awk command appends _realign_se_...). Otherwise Arriba may get confused to find the same read multiple times.

Next, align the file realign_se.fastq.gz in single-end mode as you have done in your tests. Let's say, the output file of this alignment is called realign_se.bam.

Finally, concatenate the paired-end alignments and the realigned single-end alignments in order to pass both of them as input to Arriba:

samtools cat paired-end.bam realigned_se.bam |
arriba -x /dev/stdin ...

You should now find substantially more supporting reads and some reads with realign_se in their name in the read_identifiers column.

ahwanpandey commented 4 months ago

@suhrig Thanks so much for the advice. BTW, the FASTQ files I shared actually originate from the mapped reads at the TMEM243 loci as I mentioned in the first post. I.e. they were all originally successfully mapped by STAR.

I went ahead and did your workflow on my test FASTQ files but didn't see much improvement. And still nothing in split_reads2 and very low coverage numbers for coverage2 :/ The unaligned reads from the paired-end.bam file were all the short R2 reads and they seem to just add a bit more coverage to the first gene.

The paired-end.bam actually has a high mapping percentage and the 61 singletons you see are actually the unmapped R2 mates that get extracted when I run samtools view -f4 paired-end.bam

858 + 0 in total (QC-passed reads + QC-failed reads)
7 + 0 secondary
221 + 0 supplementary
0 + 0 duplicates
797 + 0 mapped (92.89% : N/A)
630 + 0 paired in sequencing
315 + 0 read1
315 + 0 read2
492 + 0 properly paired (78.10% : N/A)
508 + 0 with itself and mate mapped
61 + 0 singletons (9.68% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

On a side note, I also wanted to ask you another question if you could help me understand the following parameters in STAR and how I can optimise them for my use case:

--alignIntronMax 
--alignMatesGapMax

The distance between two fusion genes that are well know in our cohort is about 275kb. I guess I'm trying to understand how the following parameters actually work and what it means. Sure we know of a fusion event that is 275kb apart, but what about the ones we don't know about how close or how far in the same chromosome? How does this affect them?

Thanks so much. Ahwan

suhrig commented 4 months ago

didn't see much improvement. And still nothing in split_reads2

Odd. In my tests, there was a notable difference. Maybe my alignments are different. Would it be possible to send me your alignments (from the regions around the breakpoints) from the single-end R1 alignment and from the PE+SEunmapped alignment? I would like to find out which alignments are different.

--alignIntronMax --alignMatesGapMax

These parameters determine how STAR aligns reads with long gaps. If the gap is longer than the given distance, STAR will (1) flag paired-end mates as discordant if the gap is between the mates or (2) make a supplementary alignment if the gap is inside a read. If the gap is shorter than the given distance, then paired-end mates are flagged as properly paired and split reads are aligned with an N in the cigar string.

Whatever you put as a value here (within reason, of course) should have little impact on Artiba's results. It is perfectly fine to set it to a greater value than the 275kb you mention. Arriba extracts all reads which cross gene boundaries regardless of how they are aligned.

FWIW: I run STAR with the parameters set to 1.1Mbps, because this is the size of the biggest intron according to GENCODE.

ahwanpandey commented 4 months ago

@suhrig I've uploaded the following to the Dropbox link: https://www.dropbox.com/scl/fo/kss6xp5lg9yfw76vmt4k7/h?rlkey=hb9uieje2n9onud5m9bqavysf&dl=0

test_r1 -> using just "test_r1.fastq"
test_PE_plus_SEunmapped -> following your workflow as you suggested

Thanks so much much again for your input and help!

suhrig commented 4 months ago

The file test_PE_plus_SEunmapped.bam only contains realigned reads with length 50nt. There should be many realigned reads with a length of 230nt. Maybe I didn't explain myself well before. In the first step, you should align read1 and read2 in paired-end mode. Next, you should realign all unmapped reads in single-end mode. Is this what you did?

ahwanpandey commented 4 months ago

@suhrig I did do that and the paired-end.bam has the following stats:

]$ samtools flagstat paired-end.bam

858 + 0 in total (QC-passed reads + QC-failed reads)
7 + 0 secondary
221 + 0 supplementary
0 + 0 duplicates
797 + 0 mapped (92.89% : N/A)
630 + 0 paired in sequencing
315 + 0 read1
315 + 0 read2
492 + 0 properly paired (78.10% : N/A)
508 + 0 with itself and mate mapped
61 + 0 singletons (9.68% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

when I extract the unmapped reads as you noted, it only extracts R2 reads which are the 61 singletons from the above paired-end.bam

]$ samtools view -bh -f4 paired-end.bam | samtools flagstat -

61 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
61 + 0 paired in sequencing
0 + 0 read1
61 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

suhrig commented 4 months ago

I have found the reason for our discrepant observations: I forgot to copy some of the alignment parameters from your first post in my tests. Specifically, the following two parameters lead to different behavior: --outFilterScoreMinOverLread 0.1 --outFilterMatchNminOverLread 0.1. With these parameters in place, STAR aligns reads even if only a short segment can be mapped. What will happen frequently is that part of the longer read1 is aligned, even when the mate read2 cannot be mapped. In such situations STAR does not make an attempt to map the clipped segment of read1. Previously, I thought that both paired-end mates would be unmapped, but not if the above parameters are set to low values.

The code snippets I sent you earlier only extract entirely unmapped reads. They do not extract reads where only one of the mates is unmapped. Here is an improved version, which extracts unmapped reads and clipped reads without supplementary alignment (SA:Z:) and and unmapped mate.

samtools view paired-end.bam |
awk -F '\t' 'and(4,$2) || !and(2304,$2) && and(8,$2) && !/\tSA:Z:/ && $6~/[0-9][0-9]S/ {print "@"$1"_realign_se_"NR"\n"$10"\n+\n"$11}' |
gzip > realign_se.fastq.gz

As previously, you should realign this file in single-end mode. Afterwards, you pass both the paired-end.bam and the realigned_se.bam files as input to Arriba. However, this time you need to make sure not to pass the previously partially aligned reads twice. The awk command takes care of this:

(
samtools view -h realigned_se.bam
samtools view paired-end.bam |
awk -F'\t' '!(!and(2304,$2) && and(8,$2) && !/\tSA:Z:/ && $6~/[0-9][0-9]S/)'
) |
arriba -x /dev/stdin ...

Hopefully, this works for you as well as for me. It's a bit of a hack. But STAR seems to struggle with the imbalanced read lengths, and that can't be changed easily.

ahwanpandey commented 4 months ago

Hi @suhrig thanks so much for your knowledge and detailed instructions and helping me get the best out of this data. I followed your advice and there is quite a bit of improvement!

I'll leave this issue open for a little bit while I run the analyses properly on the full data in case I have any more queries. But I think you've helped me a lot already.

Thanks again!

suhrig commented 3 months ago

Did you encounter any more problems?

ahwanpandey commented 3 months ago

Hi @suhrig . Thanks for checking in. All good so far. Thanks again. I'll close the issue for now.

suhrig / arriba

Single-End vs Paired-End behaviour for split1/2, discordant and coverage counts #230