suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
214 stars 50 forks source link

FusionCatcher Dataset #47

Closed osmanwa closed 4 years ago

osmanwa commented 4 years ago

HI,

I am trying to catchem' all with the FusionCatcher Dataset + Arriba: https://github.com/ndaniel/fusioncatcher/tree/master/test

I tried with the following command line arguments:

STAR \
    '--runMode alignReads' \
    --alignIntronMax \
    1000000 \
    --alignIntronMin \
    20 \
    --alignMatesGapMax \
    1000000 \
    --alignSJDBoverhangMin \
    1 \
    --alignSJoverhangMin \
    8 \
    --chimJunctionOverhangMin \
    15 \
    --chimOutType \
    WithinBAM \
    SoftClip \
    --chimSegmentMin \
    15 \
    --genomeDir \
    /var/lib/cwl/stg12a41017-ebd1-4396-9fdb-ef251725790f/star \
    --genomeLoad \
    NoSharedMemory \
    --limitBAMsortRAM \
    60000000000 \
    --limitOutSAMoneReadBytes \
    90000000 \
    --outFilterIntronMotifs \
    RemoveNoncanonical \
    --outFilterMismatchNmax \
    10 \
    --outFilterMismatchNoverLmax \
    0.1 \
    --outFilterMultimapNmax \
    10 \
    --outFilterType \
    BySJout \
    --outReadsUnmapped \
    Fastx \
    --outSAMmapqUnique \
    255 \
    --outSAMstrandField \
    intronMotif \
    --outSAMtype \
    BAM \
    SortedByCoordinate \
    --outSAMunmapped \
    Within \
    --outSAMmode \
    Full \
    --readFilesCommand \
    zcat \
    --runThreadN \
    8 \
    --seedSearchStartLmax \
    30 \
    --readFilesIn \
    /var/lib/cwl/stgda69f516-ac0d-4ab1-a638-7b3621f260fe/joinedfiles.dat \
    /var/lib/cwl/stg0b23d764-be65-4426-85b3-c6a7ac523b44/joinedfiles.dat

    arriba \
    -g \
    /var/lib/cwl/stgc586d5c5-6f47-4507-b3c8-32bc67801d2b/gencode.v32.annotation.gtf \
    -a \
    /var/lib/cwl/stg807396cf-c9a5-41d9-a791-8b4420370377/GRCh38.primary_assembly.genome.fa \
    -b \
    /var/lib/cwl/stga879b704-1ea1-4158-bc0a-761a72af1f02/blacklist_hg38_GRCh38_2018-11-04.tsv.gz \
    -x \
    /var/lib/cwl/stgb4d5992c-53d9-4c45-adee-e4de4f54027c/Aligned.sortedByCoord.out.bam \
    -O \
    fusions.discarded.tsv \
    -o \
    fusions.tsv \
    -P \
    -P

and got ten:

#gene1  gene2   strand1(gene/fusion)    strand2(gene/fusion)    breakpoint1     breakpo$
FGFR3   TACC3   +/+     +/+     4:1806934   4:1727977   splice-site     CDS    $
FIP1L1  PDGFRA  +/+     +/+     4:53425965  4:54274925  splice-site     CDS    $
HOOK3   RET     +/+     +/+     8:42968214  10:43116584     splice-site     splice-$
AKAP9   BRAF    +/+     -/-     7:92003235  7:140787584     splice-site     splice-$
EWSR1   ATF1    +/+     +/+     22:29287134     12:50814280     splice-site     splice-$
ETV6    NTRK3   +/+     -/-     12:11869969     15:87940753     splice-site     splice-$
EML4    ALK     +/+     -/-     2:42301394  2:29223584  splice-site     5'UTR  $
BRD4    NUTM1   -/-     +/+     19:15254152     15:34347969     splice-site     splice-$
GOPC    ROS1    -/-     -/-     6:117566854     6:117321394     splice-site     splice-$
TMPRSS2 ETV1    -/-     -/-     21:41494375     7:13935838  CDS     CDS     translo$

Then I tried changing MaxReads and MaxEValue which didnt increase the number of fusions, then tried disabling a few filters (min_support\many_spliced):

    STAR \
    '--runMode alignReads' \
    --alignIntronMax \
    1000000 \
    --alignIntronMin \
    20 \
    --alignMatesGapMax \
    1000000 \
    --alignSJDBoverhangMin \
    1 \
    --alignSJoverhangMin \
    8 \
    --chimJunctionOverhangMin \
    15 \
    --chimOutType \
    WithinBAM \
    SoftClip \
    --chimSegmentMin \
    15 \
    --genomeDir \
    /var/lib/cwl/stgf7406dee-faff-40b1-9b06-303b961e77c7/star \
    --genomeLoad \
    NoSharedMemory \
    --limitBAMsortRAM \
    60000000000 \
    --limitOutSAMoneReadBytes \
    90000000 \
    --outFilterIntronMotifs \
    RemoveNoncanonical \
    --outFilterMismatchNmax \
    10 \
    --outFilterMismatchNoverLmax \
    0.1 \
    --outFilterMultimapNmax \
    10 \
    --outFilterType \
    BySJout \
    --outReadsUnmapped \
    Fastx \
    --outSAMmapqUnique \
    255 \
    --outSAMstrandField \
    intronMotif \
    --outSAMtype \
    BAM \
    SortedByCoordinate \
    --outSAMunmapped \
    Within \
    --outSAMmode \
    Full \
    --readFilesCommand \
    zcat \
    --runThreadN \
    8 \
    --seedSearchStartLmax \
    30 \
    --readFilesIn \
    /var/lib/cwl/stgdb0d3990-f97d-40cc-9f55-c06d95ba9968/joinedfiles.dat \
    /var/lib/cwl/stg89bb78b7-93ff-43f0-a1e7-214f45ec2bce/joinedfiles.dat

arriba \
    -g \
    /var/lib/cwl/stg4b736520-cb0e-4049-a556-69e167158000/gencode.v32.annotation.gtf \
    -a \
    /var/lib/cwl/stg28d39c0b-6b89-4638-b092-bbcd871a1bc7/GRCh38.primary_assembly.genome.fa \
    -b \
    /var/lib/cwl/stg82c7408b-af40-42df-971d-809ef1387304/blacklist_hg38_GRCh38_2018-11-04.tsv.gz \
    -x \
    /var/lib/cwl/stg6832edbb-f979-4f0d-9128-843c9e69cf6c/Aligned.sortedByCoord.out.bam \
    -f \
    min_support \
    many_spliced \
    -O \
    fusions.discarded.tsv \
    -o \
    fusions.tsv \
    -E \
    1 \
    -U \
    50 \
    -P \
    -P

which gave me 15 fusions, but not the ones i was looking for and i dont think disabling the filters is the right way to expand the set of fusions:

FGFR3   TACC3   +/+     +/+     4:1806934   4:1727977   splice-site     CDS     dupl$
FIP1L1  PDGFRA  +/+     +/+     4:53425965  4:54274925  splice-site     CDS     dele$
HOOK3   RET     +/+     +/+     8:42968214  10:43116584     splice-site     splice-site $
AKAP9   BRAF    +/+     -/-     7:92003235  7:140787584     splice-site     splice-site $
EWSR1   ATF1    +/+     +/+     22:29287134     12:50814280     splice-site     splice-site $
ETV6    NTRK3   +/+     -/-     12:11869969     15:87940753     splice-site     splice-site $
EML4    ALK     +/+     -/-     2:42301394  2:29223584  splice-site     5'UTR   inve$
BRD4    NUTM1   -/-     +/+     19:15254152     15:34347969     splice-site     splice-site $
GOPC    ROS1    -/-     -/-     6:117566854     6:117321394     splice-site     splice-site $
TMPRSS2 ETV1    -/-     -/-     21:41494375     7:13935838  CDS     CDS     translocatio$
CD74    ROS1    -/-     -/-     5:150404680     6:117324415     splice-site     splice-site $
GNAS    BRAF    +/+     -/-     20:58909359     7:140783038     CDS     CDS     translocatio$
SEPTIN9 BRAF    +/+     -/-     17:77499723     7:140753339     3'UTR   CDS     translocatio$
EWSR1   FLI1    +/+     +/+     22:29287134     11:128807180    splice-site     splice-site $
NTRK3   ATP2B1  -/-     -/-     15:87880322     12:89630643     CDS     CDS     translocatio$

Any idea how I can catchem all?

FULL SET OF FUSIONS:

- FGFR3-TACC3    (short reads from [2]),
-   FIP1L1-PDGFRA  (short reads from [3]),
-   GOPC-ROS1  (short reads from [4]),
-   EWS-ATF1  (short reads from [1]),
-   TMPRSS2-ETV1  (short reads from [1]),
-   EWS-FLI1  (short reads from [1]),
-   NTRK3-ETV6  (short reads from [1]),
-   CD74-ROS1  (short reads from [1]),
-   HOOK3-RET  (short reads from [1]),
-   EML4-ALK  (short reads from [1]),
-   AKAP9-BRAF  (short reads from [1]),
-   BRD4-NUT  (short reads from [1]),
-   MALT1-IGH  (short reads from [5]),
-   IGH-CRLF2  (short reads from [6]),
-   DUX4-IGH  (short reads from [7]),
-   NPM1-ALK  (short reads from [8]), and
-   CIC-DUX4  (short reads from [9]).

Thanks, -WaO

osmanwa commented 4 years ago

Also tried with --peOverlapNbasesMin 10 and --alignSplicedMateMapLminOverLmate 0.5:

    STAR \
    '--runMode alignReads' \
    --alignIntronMax \
    1000000 \
    --alignIntronMin \
    20 \
    --alignMatesGapMax \
    1000000 \
    --alignSJDBoverhangMin \
    1 \
    --alignSJoverhangMin \
    8 \
    --alignSplicedMateMapLminOverLmate \
    0.5 \
    --chimJunctionOverhangMin \
    15 \
    --chimOutType \
    WithinBAM \
    SoftClip \
    --chimSegmentMin \
    15 \
    --genomeDir \
    /var/lib/cwl/stg01a47d27-17f3-4b73-aa32-6d61d55e50d2/star \
    --genomeLoad \
    NoSharedMemory \
    --limitBAMsortRAM \
    60000000000 \
    --limitOutSAMoneReadBytes \
    90000000 \
    --outFilterIntronMotifs \
    RemoveNoncanonical \
    --outFilterMismatchNmax \
    10 \
    --outFilterMismatchNoverLmax \
    0.1 \
    --outFilterMultimapNmax \
    10 \
    --outFilterType \
    BySJout \
    --outReadsUnmapped \
    Fastx \
    --outSAMmapqUnique \
    255 \
    --outSAMstrandField \
    intronMotif \
    --outSAMtype \
    BAM \
    SortedByCoordinate \
    --outSAMunmapped \
    Within \
    --outSAMmode \
    Full \
    --readFilesCommand \
    zcat \
    --runThreadN \
    8 \
    --seedSearchStartLmax \
    30 \
    --peOverlapNbasesMin \
    10 \
    --readFilesIn \
    /var/lib/cwl/stg38911255-d123-4855-93ae-cc79b69f2eff/joinedfiles.dat \
    /var/lib/cwl/stgb9d8d177-c3e5-4b90-99ec-81f5ec2437da/joinedfiles.dat

Got same 10 reads...

suhrig commented 4 years ago

Hi @osmanwa,

There are no Master Balls in the world of fusion detection. Every tool misses a fusion every once in a while. Your best options are to choose the one that misses the fewest or combine several tools. Although Arriba may miss a few fusions in the dataset you use, overall it has really good sensitivity compared to other methods (see the final results of DREAM challenge for an independent and unbiased confirmation of Arriba's performance, specifically the entry titled uhrigs-smc-rna-fusion-challenge-sensitive-all-2017-05-12-13-15, which exhibits the highest sensitivity of all contestants).

Regarding the FusionCatcher dataset, there are several things you can do to improve the detection rate:

  1. You should use the recommended STAR parameters for detection of chimeric alignments. They are stated in the demo script. I see you are using a value of 15 for the parameters --chimSegmentMin and --chimJunctionOverhangMin, which is higher than the recommended value of 10. Other parameters are missing. Please have a look at the demo script and use the --chim* parameters given there. Using the latest version of STAR and the recommended parameters, Arriba rediscovers 12 fusions in my tests.

  2. Some fusions are missed, because they are supported predominantly by multi-mapping chimeric reads, notably the DUX4 fusions. It is a known limitation of STAR that it does not report multi-mapping chimeric reads. I submitted a pull request to the developer of STAR a while ago, which will enable STAR to report such reads, but the developer still needs to review my request and merge it into the code of STAR. Please give it some time. When I use my patched version of STAR and a (not yet publicly available) patched version of Arriba, I can rescue all but one event.

  3. Some fusions aren't fusions in the classical sense (where one gene is fused to another yielding a hybrid protein), but rather cases of enhancer hijacking (where a gene is relocated to the promoter region of another gene thus upregulating the latter, but not giving rise to a hybrid protein). There are some filters in Arriba which penalize such rearrangements and you might want to disable those if you are primarily interested in this type of rearrangement: -f end_to_end,intergenic. Generally speaking, this is not necessary, though. Arriba has pretty good detection rate of enhancer hijacking rearrangements, even if these filters are enabled.

  4. Lastly, some fusions are missed, because the FusionCatcher dataset is not representative of a real-world scenario. For example, the reads supporting the MALT1-IGH fusion were generated from genomic breakpoints (i.e., DNA). Arriba was designed for RNA, however. The transcriptomic breakpoints are likely different ones and would probably be recognized by Arriba. Also, because this fusion was not derived from a real RNA-Seq experiment, the genes involved in this synthetic fusion lack background reads and are therefore considered an artifact by Arriba. The filter no_coverage needs to be disabled to report the event. But again, this is not recommended in general practice. In another example, the DUX4-IGH fusion, the FusionCatcher dataset lacks reads which are present in the original sample. When you download the original FastQ files and process them with Arriba, the fusion is actually found. The FusionCatcher dataset was generated to test FusionCatcher. It contains reads that FusionCatcher needs to discover the fusions. Other tools, such as Arriba, may need other reads to discover the fusions. To get a reliable impression of the sensitivity, the tools must be run on the original data.

Regards, Sebastian

osmanwa commented 4 years ago

Thank you for the thorough and insightful response!

ndaniel commented 3 years ago

Hi @suhrig

There are no Master Balls in the world of fusion detection. Every tool misses a fusion every once in a while. Your best options are to choose the one that misses the fewest or combine several tools.

That is non factual. Out there are several types of fusion genes and some fusion callers are good at finding some types of fusions but not other. For example, IGH fusions can be called with high specificity and sensistivity only and only by CICERO and FusionCatcher.

  1. Lastly, some fusions are missed, because the FusionCatcher dataset is not representative of a real-world scenario.

Actually, this is not true. That dataset contains real reads from several types of real fusions genes (except of one fusion gene that is MALAT1-IGH), which are: 1) fusion genes where fusion junction is formed by exon-exon, 2) fusion genes where fusion junction is in the intron of a gene, 3) fusions where one of the genes involved in fusion has lots of pseudogenes (here are the multimappers), 4) fusions which involve IGH/IGK/IGL loci (loci is not the same as gene).

Most of the fusion callers out there handle only type 1. Few of fusion callers can handle types 1 and 2. Even fewer fusion callers can handle types 1, 2 and 3 (eg. STAR-Fusion). But FusionCatcher and CICERO are the only ones that can handle all four types of fusions.

For example, the reads supporting the MALT1-IGH fusion were generated from genomic breakpoints (i.e., DNA). Arriba was designed for RNA, however.

Generating a fusion from DNA should make it easier to detect it and not more difficult (because splicing makes the things more challenging). For example, the fusion IGH-CRLF2 has reads from a real RNA-seq experiment and still Aribba does not find it.

It looks looking above that Arriba is missing majority of IGH fusions. This is not a surprise because IGH fusions are a different category of fusion genes and the only fusion callers developed to handle them are CICERO and FusionCatcher. One of the characteristics of why IGH fusions are so challenging to detect are as following:

In another example, the DUX4-IGH fusion, the FusionCatcher dataset lacks reads which are present in the original sample. When you download the original FastQ files and process them with Arriba, the fusion is actually found. The FusionCatcher dataset was generated to test FusionCatcher. It contains reads that FusionCatcher needs to discover the fusions. Other tools, such as Arriba, may need other reads to discover the fusions.

That means that FusionCatcher is more sensitive than Arriba in discovering DUX4-IGH. FusionCatcher needs a lot less reads than Arriba to detect DUX4-IGH fusion. Right?

The FusionCatcher dataset was generated to test FusionCatcher. It contains reads that FusionCatcher needs to discover the fusions.

That is not true! That test is manually constructed and it contains enough reads (but not to many) to detect a given fusion.

suhrig commented 3 years ago

That is non factual.

It is factual. No tool discovers all fusions. Some are better than others, but every tool misses fusions.

IGH fusions can be called with high specificity and sensistivity only and only by CICERO and FusionCatcher.

Arriba and STAR-Fusion (among others) can detect IGH rearrangements with good accuracy, too, and according to our benchmarks even better than FusionCatcher. See Figure 2B from our paper.

  1. fusion genes where fusion junction is formed by exon-exon,
  2. fusion genes where fusion junction is in the intron of a gene,
  3. fusions where one of the genes involved in fusion has lots of pseudogenes (here are the multimappers),
  4. fusions which involve IGH/IGK/IGL loci (loci is not the same as gene).

Most of the fusion callers out there handle only type 1. But FusionCatcher and CICERO are the only ones that can handle all four types of fusions.

Arriba also handles types 1-4 (even better since version 2.0.0), and I would like to add

  1. partial tandem duplications
  2. internal tandem duplications
  3. viral integration sites

Generating a fusion from DNA should make it easier to detect it and not more difficult (because splicing makes the things more challenging).

True, splicing makes alignment more difficult. But what I meant is: Having breakpoints at splice sites is an indication that a fusion candidate is not a library preparation artifact and therefore it is helpful in calling fusions from RNA-Seq.

the fusion IGH-CRLF2 has reads from a real RNA-seq experiment and still Aribba does not find it.

Arriba does find the fusion if you take the original RNA-Seq sample instead of the artificially crafted FusionCatcher dataset, supporting the notion that the FusionCatcher dataset is not reflective of real-world performance.

FusionCatcher needs a lot less reads than Arriba to detect DUX4-IGH fusion. Right?

No, it means FusionCatcher needs other reads than Arriba and the ones that were included in the FusionCatcher dataset are those that FusionCatcher needs and not those that Arriba needs. I could craft an "Arriba dataset" that works well with Arriba or a "CICERO dataset" that works well with CICERO. Other tools would almost certainly perform worse with these datasets, because they search for other reads than the ones picked up by Arriba or CICERO.

That test is manually constructed and it contains enough reads (but not to many) to detect a given fusion.

The fact that the dataset is manually constructed is precisely the reason why it is biased towards FusionCatcher. The file only contains reads that were found to map to fusion genes. There could be a whole bunch of unmapped reads that also support the fusion, but that are not part of the dataset, because the designer of the dataset was not even aware of them, since they were not mapped at all. It could very well be that other tools pick up on reads that remained unmapped by whatever method was used to create the dataset. These tools would thus have a disadvantage.

If you want a realistic measurement of performance you should:

  1. run the tools on many (unmodified!) samples and see which one picks up the most fusions
  2. (optionally, for even more fine-grained measurements) perform a down-sampling experiment where the tools are run on all of these samples again, but with fewer and fewer (randomly!) selected reads

This approach will give you the most informative results. You should not select reads with any type of bias. And you should not use just a single sample for benchmarking, because sample sizes of n=1 are rarely representative. The FusionCatcher dataset is afflicted with both of these shortcomings and is therefore not reflective of real-world performance. In contrast, we applied method 1 to TMPRSS2-ERG fusions in prostate cancer, IGH rearrangements in DLBCL, and driver fusions in pancreatic cancer, and all of the tests concluded with Arriba having favorable sensitivity compared to other tools (see our paper).

ndaniel commented 3 years ago

@suhrig

Sorry, but the latest version of Arriba v2.1.0 is missing lots and lots of IGH fusions! FusionCatcher and CICERO perform much better on IGH fusions.

For example, Aribba is mising IGH-CRLF2 fusion in MUTZ5 cancer cell line! Here is the original raw RNA-seq data for MUTZ5 cell line from CCLE database https://portals.broadinstitute.org/ccle :

and please free to run Arriba on it and see if Arriba finds IGH-CRLF2 fusions or not! In my hands it does not find it! This raw RNA-seq dataset is not made up or crafted or biased for/towards FusionCatcher or CICERO. FusionCatcher and CICERO are able to find the IGH-CRLF2 fusion in this RNA-seq dataset!

This is just one example. It looks to me that Arriba misses all IGH fusions that are like IGH-CLRF2 (please @suhrig, feel to explore and see what type of IGH fusion I am talking here ;-) ). There are several sub-types of IGH fusions and it looks to me that Arriba finds the ones that are the simplest, which are quite rare actually.

Regarding Figure 2B, there is no ground truth for IGH fusions in TCGA-DLBC and therefore a raw number of IGH fusions does not say anything! Maybe some fusion callers are finding IGH fusions that are false-positive (and the fusion caller with the highest false-positive rate wins)? How many IGH fusions are for real in TCGA-DLBC? Nobody knows for now.

In MUTZ5 cancer cell line one has the ground truth, which is that it harbors IGH-CRLF2 fusion, as shown here:

On this one can do serious comparison. See: https://github.com/suhrig/arriba/issues/106 .

If you want a realistic measurement of performance you should:

  1. run the tools on many (unmodified!) samples and see which one picks up the most fusions

This the wrong way of measuring performance. This just measures sensitivity and no specificity. This is what has been done in Figure 2B from the paper. Definitively one should not do this!

So here it is an updated list of types of fusions: 1) fusion genes where fusion junction is formed by exon-exon (these are easy to find), 2) fusion genes where fusion junction is in the intron of a gene, 3) fusions where one of the genes involved in fusion has lots of pseudogenes (here are the multimappers, like CIC-DUX4), 4) fusions which involve IGH/IGK/IGL loci (loci is not the same as gene, here fusion junction in IGH locus even that is interegenic is acceptable for IGH fusion), 5) fusions of human gene with viral gene.

suhrig commented 3 years ago

For example, Aribba is mising IGH-CRLF2 fusion in MUTZ5 cancer cell line!

When I run Arriba with the instructions given in the user manual the fusion is found (hg19 breakpoints 14:106329465 and X:1347560). Something must be different in your environment.

On this one can do serious comparison.

A sample size of n=1 is not a serious comparison.

This is just one example. It looks to me that Arriba misses all IGH fusions that are like IGH-CLRF2

If you can share these samples with me, I will gladly have a look at them and see if I can improve the detection rate of this particular rearrangement. Without additional data, there is not much I can do.

This the wrong way of measuring performance. This just measures sensitivity and no specificity.

You wrote that "Arriba v2.1.0 is missing lots and lots of IGH fusions". This is a question of sensitivity, and the data presented in Figure 2B is precisely the kind of analysis that answers this question. I agree that a holistic benchmark must also look at specificity, but the claim that Arriba misses fusions which FusionCatcher detects is clearly not supported by Figure 2B. Arriba finds all IGH fusions that FusionCatcher finds, plus some more. Due to lack of ground truth, it is not proven that all of these fusion calls are correct, but it is quite likely that they are, because these rearrangements are common in DLBCL. Moreover, Arriba does not make such calls in any of the other 800+ samples that we discuss in the paper, whereas FusionCatcher reports several pancreatic tumors to harbor IGH-BCL translocations, which tells you something about the specificity of the tools.

So here it is an updated list of types of fusions:

  1. fusion genes where fusion junction is formed by exon-exon (these are easy to find),
  2. fusion genes where fusion junction is in the intron of a gene,
  3. fusions where one of the genes involved in fusion has lots of pseudogenes (here are the multimappers, like CIC-DUX4),
  4. fusions which involve IGH/IGK/IGL loci (loci is not the same as gene, here fusion junction in IGH locus even that is interegenic is acceptable for IGH fusion),
  5. fusions of human gene with viral gene.

Arriba detects all of these, plus partial tandem duplications, circular RNAs, internal tandem duplications and fusions with intergenic breakpoints. So this is more of a selling point for Arriba. Sure, there are some IGH rearrangements that Arriba misses, just like FusionCatcher misses some that Arriba finds. This is exactly what I meant when I wrote "every tool misses a fusion every once in a while" and that it's good to "combine several tools".

This issue was originally about the FusionCatcher dataset and not IGH rearrangements specifically. I have made it clear why I believe the FusionCatcher dataset is not ideal for benchmarking. Let's continue the discussion about IGH rearrangements in the dedicated issue you created.