suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
226 stars 49 forks source link

what human reference genome should i use? #149

Closed yoongmin closed 2 years ago

yoongmin commented 2 years ago

Hi, I'm running RNAseq for CML patients. However, the fusion.tsv and visualization pdf did not show the main fusion gene for CML (eg: BCR, ABL1 gene). I'm using GRCh38+GENCODE28 for annotation. Is this the correct human reference genome for CML patients?

Or which human reference genome should I use for CML patients? I wonder which reference genome should I refer to as so many human reference genomes available...

suhrig commented 2 years ago

Hi,

In order to detect BCR-ABL1 fusions, any of the available human reference genomes should work. The detection rate of this fusion does not depend on the reference genome. Much rather it is a matter of alignment parameters or sample quality or the fusion was missed, because it is expressed at a low level. To narrow down the potential root causes, can you answer the following questions, please?

If you want/can share the data with me, I can have a look at the alignments and try to figure out what is going wrong. It should be sufficient to send the reads mapping to BCR and ABL1.

Regards, Sebastian

yoongmin commented 2 years ago

Hi Sebastian, The read length is 150bp and is paired end data. The bam details as below: Number of input reads | 238188256 Average input read length | 199 UNIQUE READS: Uniquely mapped reads number | 207060218 Uniquely mapped reads % | 86.93% Average mapped length | 197.68 Number of splices: Total | 90671304 Number of splices: Annotated (sjdb) | 88770559 Number of splices: GT/AG | 89602719 Number of splices: GC/AG | 662091 Number of splices: AT/AC | 62066 Number of splices: Non-canonical | 344428 Mismatch rate per base, % | 0.33% Deletion rate per base | 0.02% Deletion average length | 1.96 Insertion rate per base | 0.02% Insertion average length | 1.71 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 0 % of reads mapped to multiple loci | 0.00% Number of reads mapped to too many loci | 25108427 % of reads mapped to too many loci | 10.54% UNMAPPED READS: Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 4696516 % of reads unmapped: too short | 1.97% Number of reads unmapped: other | 117053 % of reads unmapped: other | 0.05% CHIMERIC READS: Number of chimeric reads | 1616008 % of chimeric reads | 0.68%

I've attached the pdf, tsv and discarded.tsv file in the wetransfer link for your reference. Do let me know if you need any other info. MR21_2353.pdf https://we.tl/t-jDrg6TbUAH

Your advice is much appreciated.

suhrig commented 2 years ago

Thanks for the details. There is no evidence of the fusion in the discarded file. And overall the number of split reads is low for all of the fusions. I suspect the following to be the underlying issue: The insert size ("Average input read length") is rather small compared to the read length. This will lead to most of the paired-end reads overlapping. Older versions of STAR had trouble aligning split reads when paired-end mates overlap. There is a special parameter to circumvent this issue: --peOverlapNbasesMin. However, if you want to use this feature with Arriba, you need STAR >=2.7.6a. Moreover, in the most recent version of STAR (2.7.10a), there was another enhancement to improve the chimeric alignment of overlapping paired-end reads. Which version of STAR do you use? Do you use the recommended alignment parameters for fusion detection?

To confirm my suspicion, would it be possible to send me the output of this command? It extracts the alignments from BCR and ABL1 (hg38 coordinates)?

samtools view -b Aligned.out.sortedByCoord.bam 9:130711881-130889675 22:23178509-23320037 > output.bam
yoongmin commented 2 years ago

I'm currently using STAR 2.7.3a. I use aloft.sh to generate the data. The read length is 100bp paired end reads. Sorry for the wrong info just now.. I've also attached my fastp info for your reference. General fastp version: 0.21.0 (https://github.com/OpenGene/fastp) sequencing: paired end (100 cycles + 100 cycles) mean length before filtering: 100bp, 100bp mean length after filtering: 99bp, 99bp duplication rate: 51.330759% Insert size peak: 164 Before filtering total reads: 483.096820 M total bases: 48.309682 G Q20 bases: 46.705321 G (96.679007%) Q30 bases: 43.189458 G (89.401247%) GC content: 50.208883% After filtering total reads: 476.376512 M total bases: 47.585784 G Q20 bases: 46.195388 G (97.078128%) Q30 bases: 42.825584 G (89.996594%) GC content: 50.061079% Filtering result reads passed filters: 476.376512 M (98.608911%) reads with low quality: 6.719082 M (1.390835%) reads with too many N: 1.194000 K (0.000247%) reads too short: 32 (0.000007%)

I've attached the output.bam for your reference. https://we.tl/t-7xfsiEiRFS

Let me try to upgrade my STAR version and re-run again..

suhrig commented 2 years ago

I had a quick glance at your alignments. Despite the reads being only 100nt long, they do overlap frequently. Recent versions of STAR can deal with this. Make sure to realign using the optimized alignments parametes! Otherwise, the recent versions of STAR are not much better in dealing with this than old ones.

yoongmin commented 2 years ago

Is it means that by running run_arriba.sh script can solve the problem?

suhrig commented 2 years ago

Yes! The script uses the optimized parameters.

suhrig commented 2 years ago

You will need to install a recent version of STAR to use these parameters. Maybe it is the most convenient for you to use Docker or Singularity. Check out the quickstart guide for all supported installation procedures: https://arriba.readthedocs.io/en/latest/quickstart/

yoongmin commented 2 years ago

Do I have to upgrade my arriba also? My current version is arriba v2.0.0..

suhrig commented 2 years ago

Newer is always better, but for this particular fusion it does not matter.

suhrig commented 2 years ago

I had a close look at the alignments you sent me. The overlapping reads could explain a false negative. But the coverage is really high in BCR and ABL1. Despite many of the mates overlapping, I would expect to see at least some chimeric reads in one of the two genes. But I could not find any. How certain are you that the fusion should be found in this sample? Is it only missing in this sample or is it not detected in a whole bunch of samples all of which are CML? If only this particular sample is affected, then maybe it is a cryptic rearrangement that does not have breakpoints at the common sites in BCR/ABL1 and cannot be detected using RNA-Seq. Or maybe the fusion really is not there. One simple yet powerful trick that I have used in the past to check for evidence of a missed fusion is to "grep" the FastQ files for the junction sequences. It only takes a few minutes to run this check. You can use the following command to search the FastQs for the most common BCR-ABL1 junction sequences:

zgrep 'CAGAGTTCAAAAGCCCTTCA\|TGAAGGGCTTTTGAACTCTG\|AATAAGGAAGAAGCCCTTCA\|TGAAGGGCTTCTTCCTTATT\|CAGAGTTCAAGTGAAAAGCT\|AGCTTTTCACTTGAACTCTG\|GGAGACGCAGAAGCCCTTCA\|TGAAGGGCTTCTGCGTCTCC' read1.fastq.gz read2.fastq.gz

If this command does not return any reads, then the fusion is very likely not there or just very hard to detect for whatever reason.

yoongmin commented 2 years ago

I've sequenced 8 samples from CML patients but none of samples have BCR and ABL1 fusion gene. That's why I wonder is there something wrong with my analysis. I tried to "grep" but none of the sequence matched in all the fastq file. Is it common for CML sample that BCR and ABL1 gene not found in the sample? In addition, for the library preparation part, I used RNA directional library preparation. Will this affected the sequencing reads which caused no sequence for BCR and ABL1 in the run..?

yoongmin commented 2 years ago

Hi Sebastian, do you mind to check one of my fastq data..? Just wanna make sure what I've done is correct.. The link for raw data as below: https://drive.google.com/drive/folders/1FvzHi7e3MXn6-nNJabw87q6OhypzT-Iu?usp=sharing

Your help is much appreciated.. Thank you..

suhrig commented 2 years ago

Is it common for CML sample that BCR and ABL1 gene not found in the sample?

No. In my experience they are detected robustly. I have analyzed a number of CML samples and can't remember a case where it was not found. And the grep method is highly likely to detect it if it's there.

In addition, for the library preparation part, I used RNA directional library preparation. Will this affected the sequencing reads which caused no sequence for BCR and ABL1 in the run..?

Stranded RNA libraries are a good thing. The strand information helps map reads to the correct originating gene. So if anything, this should improve the results. Overall, your RNA library looks fine, apart from the overlapping mates, but as I said the overlapping mates are not severe in your case and can hardly explain why the fusions are missed.

do you mind to check one of my fastq data..?

I have processed the sample using my own scripts. I can confirm that the fusion is not detected - neither by Arriba nor by grep. So it seems you have done everything correctly.

It's a desperate guess, but: How certain are you that the fusion should be there? I mean, for CML the fusion is a hallmark aberration and should definitely be present, but did you actually confirm it in some of the samples using FISH or other methods? CD34 is a common marker for CML, but I noticed it is expressed at a low level in the sample you sent me. No CD34 and no BCR-ABL1 fusion are two arguments against the sample really being a CML. Were the samples enriched for the right cell type if there was an enrichment step? The expression profile of the sample you sent me clearly looks like leukocytes, but maybe not CML ... Perhaps they are benign cells?

yoongmin commented 2 years ago

The lab confirmed the result using FISH but I'm not sure do they using the same samples for sequencing run since the data do not have these fusion genes. FYI, we only used Dynabeads mRNA purification kit followed by RNA directional library prep. So, I don't think it will affect our sequencing data.. Ya.. perhaps they are benign cells.. need to confirm with the lab side again.. May I know based on your analysis results, do the sample data given has any other important fusion genes that can related to CML patient?

suhrig commented 2 years ago

FYI, we only used Dynabeads mRNA purification kit followed by RNA directional library prep. So, I don't think it will affect our sequencing data..

I agree, this should not affect the fusion detection in a negative way. But maybe it could be influenced positively by enriching for the relevant cells before sequencing. Depends on how prominent the leukemic clone is in the blood.

do the sample data given has any other important fusion genes that can related to CML patient?

Most of the fusion calls look like potentially benign transcripts, rare germline variants or artifacts. There is one fusion involving PIM3, a gene that has been implicated in AML and shown to be involved in fusions. The detected breakpoints are out-of-frame, however, and the read support is not great, so I am not 100% convinced this is relevant or even a correct call. If anything, it may be a subclonal variant.

I wish you good luck figuring out with your lab colleagues how these unexpected results can be explained.

yoongmin commented 2 years ago

Dear Sebastian,

Sure. Will figure out with my lab colleagues. Thanks a lot for your help! =)

Regards, Yoong Min