pizzly missing a fusion from SRR1659964

roryk commented 6 years ago

Hi Pall,

Thanks for making pizzly. I've been working on adding support for it and some other fast fusion callers to bcbio and have been seeing if I can figure out some way of filtering the fusions to get a cleaner set. Just running the defaults, pizzly is missing the TMPRSS2:ETV1 fusion in the SRR1659964 dataset from the preprint. It shows up in the unfiltered JSON file with some support though, but not the filtered one. Do you have some suggestions for some settings to get this to be identified? How are the transcripts filtered? Thanks! I have run pizzly on the entire fusion dilution series in that dataset if that would be helpful for you all.

        fusion      tpm known
         <chr>    <dbl> <lgl>
1     EML4:ALK 1.762218  TRUE
2   ETV6:NTRK3 5.117153  TRUE
3   BRD4:NUTM1 1.345507  TRUE
4    HOOK3:RET 3.747249  TRUE
5   EWSR1:ATF1 2.092241  TRUE
6 TMPRSS2:ETV1 0.000000  TRUE
7   AKAP9:BRAF 3.046220  TRUE
8    CD74:ROS1 5.370045  TRUE
9   EWSR1:FLI1 1.452139  TRUE

roryk commented 6 years ago

I've been working a little bit on filtering the calls from pizzly-- if you use the blacklist genes from FusionMap to filter the calls (http://www.omicsoft.com/downloads/fusion/filtering/GeneList_v1.txt) you can dump maybe 10% of the calls. I created a blacklist filter of overlapping genes here (https://github.com/roryk/fusion-gene-blacklist) which doesn't filter any of the calls but is probably useful to check against anyway. The family filter from FusionMap (http://www.omicsoft.com/downloads/fusion/filtering/GeneFamilyList_v1.txt) doesn't do much on this dataset, it just filters out a couple of the pairs.

roryk commented 6 years ago

pizzly works great. I ran the entire dilution series for the spike in dataset barring one file I didn't download and put together some plots demonstrating that pizzly does a great job even at low dilutions. I also looked at filtering out the fusion calls-- it turns out > 99% of the fusions only have support via a paired read or a split read but not both. If you filter fusions and require at least one read with each type of evidence, this removes thousands of fusion calls but retains the spike in calls. If you filter the spike ins by the FusionMap blacklists, you can drop more genes. It looks like there are a set of fusions that are called repeatedly that are likely false, they are called within genes in the same family and in genes that have been known to be all over the place in the genome.

Below is a plot of the known fusions vs the novel fusions for the dilution dataset, post filtering from the report.

This is a table of how many times, post filtering, we saw each fusion call, so total is the number of samples from the dilution series we saw each fusion call. I got into it in the report a little bit, but the NPEPPS:TBC1D3 family is likely to be false positives; TBC1D3 is kind of all over the place in the genome and it might have ended up overlapping NPEPPS, but isn't annotated that way. It could be a legitimate fusion call as well, though. The CSAG* fusions should likely be blacklisted due to being in the same family.

fusion	known	total
AKAP9:BRAF	TRUE	19
NPEPPS:TBC1D3	FALSE	19
NPEPPS:TBC1D3C	FALSE	19
NPEPPS:TBC1D3D	FALSE	19
NPEPPS:TBC1D3E	FALSE	19
NPEPPS:TBC1D3G	FALSE	19
NPEPPS:TBC1D3H	FALSE	19
NPEPPS:TBC1D3I	FALSE	19
NPEPPS:TBC1D3K	FALSE	19
NPEPPS:TBC1D3L	FALSE	19
EWSR1:ATF1	TRUE	18
EWSR1:FLI1	TRUE	18
BRD4:NUTM1	TRUE	17
CSAG2:CSAG1	FALSE	15
CSAG3:CSAG1	FALSE	15
ETV6:NTRK3	TRUE	15
HOOK3:RET	TRUE	15
CD74:ROS1	TRUE	14
CTBS:GNG5	FALSE	11
MTG1:SCART1	FALSE	9
AL360181.3:SCART1	FALSE	8
TMPRSS2:ETV1	TRUE	7
CD74:AL132671.2	FALSE	6
EML4:ALK	TRUE	6
ZNF600:ZNF578	FALSE	5

https://www.dropbox.com/s/q3b9getlg7nirqe/pizzly-dilution-series.tar.gz?dl=0 is a link to the report-- it has everything needed to reproduce the results starting from the filtered pizzly calls.

ndaniel commented 6 years ago

@roryk

I would be very carefully on filtering out found fusion genes based on blacklist filter of overlapping genes from here https://github.com/roryk/fusion-gene-blacklist . That black list contains that, for example, the genes DHX8 and ETV4 are overlapping but when looking to scientific articles, actually those genes are known to form a known fusion gene DHX8-ETV4 (see: http://atlasgeneticsoncology.org/Genes/GC_ETV4.html and http://journals.plos.org/plosone/article/file?type=supplementary&id=info:doi/10.1371/journal.pone.0051203.s004 and so on).

Most probably most of these are errors in gene annotation databases, which have the tendency to annotate known fusion genes as overlapping genes. See for example, that until release 80 (if I remember right), Ensembl was listing the genes CD74, ROS1, and GOPC as overlapping even that is very well established in the medical literature that in only in tumors one has the fusion genes CD74-ROS1 and GOPC-ROS1 (even today, if I remember right one of this fusion has its genes are overlapping in Genecode?).

Regarding filtering usage of the blacklist genes from FusionMap to filter the calls http://www.omicsoft.com/downloads/fusion/filtering/GeneList_v1.txt , I would not use it because it contains genes which are very well known to form fusion genes. For example http://www.omicsoft.com/downloads/fusion/filtering/GeneList_v1.txt contains hundreds of genes which have their names starting with IGH and IGK (like for example IGHEP1, IGHEP2, IGHJ1P, IGHJ2P, IGHJ3P, IGHV1-12, IGHV1-14, IGHV1-17, IGHV1-67, etc.) and all of those genes are very well known to form fusion genes and translocations (in the scientific literature any gene that has its gene name/symbol starting with IGH is called always shortly IGH, for example one has fusions like IGH-DUX4, IGH-FGFR3, etc.). For more info, see: https://en.wikipedia.org/wiki/IGH@ and http://www.bloodjournal.org/cgi/pmidlookup?view=long&pmid=12805059 and http://atlasgeneticsoncology.org/Genes/GC_IGH.html )

roryk commented 6 years ago

Thanks @ndaniel, that is super useful information. It sounds like just flagging the fusions as possibly due to overlapping annotations/etc is a better strategy than straight up blacklisting them. I am making a pizzly-specific list by simulating from every transcript in GRCh38.89 and calling fusions with pizzly to see where it calls just based on the annotation so we could flag fusion calls with that information as well.

ndaniel commented 6 years ago

@roryk

It sounds like just flagging the fusions as possibly due to overlapping annotations/etc is a better strategy than straight up blacklisting them.

I think that flagging is much better.

I am making a pizzly-specific list by simulating from every transcript in GRCh38.89 and calling fusions with pizzly to see where it calls just based on the annotation so we could flag fusion calls with that information as well.

I guess that you mean simulating reads from every transcript and calling fusions using Pizzly from the simulated reads. If yes, then this is just another way of computing the similarity between genes at RNA level (for a given length of input read, e.g. 100bp).

roryk commented 6 years ago

Thanks so much for the help @ndaniel.

schelhorn commented 6 years ago

My background in this matter is rather shallow, but using starfusion we (also) often see artifactual fusion calls involving human paralogues which seem to originate from ambigious read mappings due to very high local sequence similarity and/or an incomplete genomic reference. As a consequence, the paralogues and (optionally) regions adjacent to the paralogues may falsely implied in gene fusions.

I guess excluding genomic repeat regions and filtering regions involving (or being adjacent to) regions of human self-chained alignments my help here, wouldn't it? Pre-prepared BED files for both kinds of regions are provided by bcbio in the genome-specific problem_regions folders, for instance.

Further following up on the filtering aspect, would you guys and @pmelsted think it desirable to present the pizzly calls in their functional context, for instance by generating an output format compatible with Oncofuse and then use that as postprocessing (since it already is included in bcbio)?

ndaniel commented 6 years ago

@schelhorn

... using starfusion we (also) often see artifactual fusion calls involving human paralogues which seem to originate from ambigious read mappings due to very high local sequence similarity and/or an incomplete genomic reference. As a consequence, the paralogues and (optionally) regions adjacent to the paralogues may falsely implied in gene fusions.

This is true BUT paralogues (or genes with high sequence similarity, like for example pseudogenes) are also an obstacle in finding fusion genes, as shown here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0099439 . Many fusion finders will miss such kind of fusion genes (e.g. Jaffa and STAR-fusion never finds IGH-DUX4 fusions and also Pizzly never finds IGH-DUX4 fusions as shown here: https://github.com/pmelsted/pizzly/issues/2 ).

I guess excluding genomic repeat regions and filtering regions involving (or being adjacent to) regions of human self-chained alignments my help here, wouldn't it?

I am not so sure about that because there are fusions genes which have genes which have very repetitive regions, like for example DUX4, CIC, etc. among other regions of genome and if one would remove those repetitive regions then one would not be able to detect anymore those fusions. See again: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0099439

Further following up on the filtering aspect, would you guys and @pmelsted think it desirable to present the pizzly calls in their functional context, for instance by generating an output format compatible with Oncofuse and then use that as postprocessing (since it already is included in bcbio)?

I think that is not such a good idea to be use a format based on OncoFuse because it would impede finding novel fusions genes. Oncofuse, is using knowledge extracted from currently known fusion genes and currently known oncogenes and therefore it will predict best the fusions which already match fully/partially the current knowledge of fusion genes and oncogenes. Therefore any novel fusion gene which does not match the current knowledge will be predicted most likely wrong by Oncofuse for the simple reason that it has never seen something like that. Out there are many fusion genes still waiting to be discovered.

schelhorn commented 6 years ago

Both IGH and DUX4 are very problematic genes, to say the least. From the point of short read sequencing they are just horrible and drawing inferences from them in clinical timelines is very hard. In my view it would make sense to always exclude predicted fusions similar to these monsters on terms of repeats/self-chaining; you loose sensitivity it's true, but the game is in specificity anyway. If we would have to follow up all predicted fusions that are similar in noisiness to IGH-DUX4 in the wetlab we could stop doing anything else I guess ;)

But I respect of course that people have different goals, so perhaps a permissive as well as a restrictive mode makes sense.

ndaniel commented 6 years ago

@schelhorn

Both IGH and DUX4 are very problematic genes, to say the least.

These genes are very well known and have been used as biomarkers in clinics for many years. These genes are not that problematic as it may look at first glance.

From the point of short read sequencing they are just horrible

That is not correct. Out there are fusion finders which are able to find fusions like IGH-DUX4 just fine, like for example deFuse, CICERO, FusionCatcher, etc.

and drawing inferences from them in clinical timelines is very hard.

Actually this is not true. See this.

In my view it would make sense to always exclude predicted fusions similar to these monsters on terms of repeats/self-chaining; you loose sensitivity it's true, but the game is in specificity anyway.

As shown above out there are very good fusion callers which are able to find fusion genes like IGH-DUX4 whilst having the best or very good specificity.

schelhorn commented 6 years ago

Alright, as I said, my understanding of the matter is quite shallow to begin with. If there is a good way for identifying new fusion gene pairs that have characteristics similar to these two without sacrificing much of specificity then I'm all for it.

I'm just worried that concentrating on the most difficult validated fusion gene pairs may either overfit whatever filtering procedure is employed, or lead to false positives if the filters are loosened.

My remarks concerning clinical timelines was referring not to this specific pair but to a hypothetical, long list of repeat-rich/paralogous gene pairs that are called given loose filtering criteria and that will likely be hard to interpret or feel confident about. I'd like to avoid this by any means convenient.

ndaniel commented 6 years ago

@schelhorn

I'm just worried that concentrating on the most difficult validated fusion gene pairs

There is no such thing as difficult fusion gene. There is just tools/programs which are not able to find fusion genes.

...I'm just worried that concentrating on the most difficult validated fusion gene pairs

That depends mainly on the authors of the fusion caller which tune (or do not tune) their fusion caller such that they strike a balance between specificity and sensitivity.

... long list of repeat-rich/paralogous gene pairs that are called given loose filtering criteria and that will likely be hard to interpret or feel confident about.

Those should not be hard to interpret. Any lab technician who knows to perform an PCR experiment in the wetlab is able to figure out this (see: how to design PCR primers for a target gene when that gene has pseudogenes or paraloguous genes which have some sequence similarity). The sequence similarity between different genes in my opinion should be computed by the fusion caller for each input RNA-seq dataset because the sequence similarity when looking for fusion genes depends on the read length on the input FASTQ files. Some genes might look like they have high similarity sequence when the input reads are 50bp long but not so similar when the input reads are 150bp long (for example in the extreme case all genes will look like highly similar when the reads are 1bp long and again they will look very dissimilar when the reads would be 10000bp long). Again this is not difficult!

arthuryxt commented 5 years ago

@ndaniel Great comments!

The desired balance between FPR and FNR varies in different tools. I would go for a higher sensitivity, as one can nearly always prove false of a candidate in wet-lab but can do nothing for FNs.

Do you think it's be reasonable to compile a list of wet-lab validated fusion gene pairs, and directly test the compatibility of RNAseq reads with these "mini-chromosomes"? And then go on with the de novo prediction mode?

ndaniel commented 5 years ago

Do you think it's be reasonable to compile a list of wet-lab validated fusion gene pairs, and directly test the compatibility of RNAseq reads with these "mini-chromosomes"?

I think that is a supervised approach and of course this would work better than the unsuperivsed approach used by the (all?) current gene fusion finders (but both methods answer different questions). As far as I know there is no such tool that does supervised finding of fusions genes with a very large set of a priori known fusion genes.

pmelsted / pizzly

pizzly missing a fusion from SRR1659964 #19