pmelsted / pizzly

Fast fusion detection using kallisto
BSD 2-Clause "Simplified" License
80 stars 10 forks source link

False positive fusions and missed fusions #2

Closed ndaniel closed 7 years ago

ndaniel commented 7 years ago

Hi!

When running Pizzly on this RNA-seq data set (using Ensembl annotation release 81) , which contains only and only the following biologically real 17 fusion genes:

Pizzly finds the following fusion genes (see: test.json.gz):

Therefore one has that the following false fusions genes are found by pizzly, even that they do not exist in the original RNA-seq dataset from above:

Pizzly, is missing the following fusion genes in the above RNA-seq data set:

Is this correct?

pmelsted commented 7 years ago

Thanks for pointing this out. I'll take a look at this soon

ndaniel commented 7 years ago

I have run Pizzly on other several RNA-seq data sets and it finds a lot of false positives. What is the expected sensitivity and specificity rates for Pizzly?

lakigigar commented 7 years ago

Those are not false positives. We are reporting an unfiltered set of fusions for users who want to filter by their own criteria. We have our own set of basic filters and will be providing filtered predictions as well; along with those will be the manuscript of the paper that will report our accuracy on simulations and biological datasets. I expect that the manuscript will be posted in a week or two; we have already run many benchmarks and are just double-checking our results.

On Tue, Mar 28, 2017 at 10:35 AM, Daniel Nicorici notifications@github.com wrote:

I have run Pizzly on other several RNA-seq data sets and it finds a lot of false positives. What is the expected sensitivity and specificity rates for Pizzly?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/2#issuecomment-289846266, or mute the thread https://github.com/notifications/unsubscribe-auth/AC042GKebK_I4SaziT6XFx6BtXcIbrL6ks5rqUThgaJpZM4MlGn1 .

ndaniel commented 7 years ago

Ok. I am confused regarding calling unfiltered fusion as not being false positives. Is there any already published article which supports this hypothesis? What about the missed fusions?

ndaniel commented 7 years ago

Here is some idea to check pizzly's sensitivity and specificity on real biological fusion genes (and not on fake/junk chimeric RNAs), which are validated in the wetlab:

Pick a fusion sequence of your choice from NCBI database:

and simulate reads out of it (using for example: https://github.com/lh3/wgsim ).

lakigigar commented 7 years ago

I am not sure what you mean by "hypothesis".

The goal of (RNA-Seq based) fusion finding programs is to produce for users a set of fusion genes that are supported by the data. With kallisto pseudoalignment followed by pizzly we have a very fast procedure for producing candidate fusions, but these candidate fusions have not yet been filtered with respect to a variety of standard ad-hoc criteria, e.g. coverage requirements (requiring at least a certain number of reads to support the fusion). We can impose such filters ourselves (and have) but we believe the list of candidate fusions are useful in their own right, as some users may wish to not only apply coverage filters as we are (that is easy enough to allow via a parameter) but also to filter by criteria that are unique to their application. The lists you've looked at are our candidate fusions not yet filtered by us for the standard end-users. As I've said, you'll be able to look at those lists together with our sensitivity/specificity results shortly. Regarding your question about false negatives, if a fusion is not on our candidate list then it will be missing also from our filtered list. We believe that at this point we have a pretty good idea of why and when we may have false negatives. Thanks for your patience.

On Tue, Mar 28, 2017 at 11:33 AM, Daniel Nicorici notifications@github.com wrote:

Ok. I am confused regarding calling unfiltered fusion as not being false positives. Is there any already published article which supports this hypothesis? What about the missed fusions?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/2#issuecomment-289863377, or mute the thread https://github.com/notifications/unsubscribe-auth/AC042NGmoG3xx6qL6ZXA7A5TpAwvHEVSks5rqVKMgaJpZM4MlGn1 .

ndaniel commented 7 years ago

Fusion genes are very rare events in tumors, except TMPRSS2-ERG fusion in prostate tumors. Finding fusion genes in 3% of some type of cancers can be seen as very high number. That means that in real life one finds quite rarely fusions in samples from real tumor patients. My point here is that fusion genes are not SNPs, which one finds tens of thousands of them in a sample such that one needs to post-filter them. This is why I think that finding fusion genes is very very easy but the difficult part is to be specific.

lakigigar commented 7 years ago

Exactly.

On Tue, Mar 28, 2017 at 11:58 AM, Daniel Nicorici notifications@github.com wrote:

Fusion genes are very rare events in tumors, except TMPRSS2-ERG fusion in prostate tumors. Finding fusion genes in 3% of some type of cancers can be seen as very high number. That means that in real life one finds quite rarely fusions in samples from real tumor patients. My point here is that fusion genes are not SNPs, which one finds tens of thousands of them in a sample such that one needs to post-filter them. This is why I think that finding fusion genes is very very easy but the difficult part is to be specific.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/2#issuecomment-289870584, or mute the thread https://github.com/notifications/unsubscribe-auth/AC042L5H-DU8SSUa3kHAsdhXfZTy8DP5ks5rqVhvgaJpZM4MlGn1 .

ndaniel commented 7 years ago

Just to make sure that I understood. Those post-filters are not yet part of Pizzly (as it is on github today). Will those post-filters be part of Pizzly or they will be a separate/independent thing (e.g. Zizzly)?

pmelsted commented 7 years ago

The post filters will be integrated into pizzly directly (coming soon) and pizzly will output a filtered and an unfiltered call set.

ndaniel commented 7 years ago

Here is a very good article about definition of fusion genes which has just been published (like for example fusion genes do not exit in 99% of healthy samples from healthy people and actually are errors in genes annotations and the starting assumption for any fusion gene finder is that if a fusion finder finds fusion genes in healthy samples then those are false positive fusions and this is according to the below article):

It Is Imperative to Establish a Pellucid Definition of Chimeric RNA and to Clear Up a Lot of Confusion in the Relevant Research

http://www.mdpi.com/1422-0067/18/4/714/htm

There have been tens of thousands of RNAs deposited in different databases that contain sequences of two genes and are coined chimeric RNAs, or chimeras. However, “chimeric RNA” has never been lucidly defined, partly because “gene” itself is still ill-defined and because the means of production for many RNAs is unclear. Since the number of putative chimeras is soaring, it is imperative to establish a pellucid definition for it, in order to differentiate chimeras from regular RNAs. Otherwise, not only will chimeric RNA studies be misled but also characterization of fusion genes and unannotated genes will be hindered. We propose that only those RNAs that are formed by joining two RNA transcripts together without a fusion gene as a genomic basis should be regarded as authentic chimeras, whereas those RNAs transcribed as, and cis-spliced from, single transcripts should not be deemed as chimeras. Many RNAs containing sequences of two neighboring genes may be transcribed via a readthrough mechanism, and thus are actually RNAs of unannotated genes or RNA variants of known genes, but not chimeras. In today’s chimeric RNA research, there are still several key flaws, technical constraints and understudied tasks, which are also described in this perspective essay.

Some quotes from the article:

In very rare situations, fusion genes can occur in normal human individuals as well, as exemplified by the TFG-GPR128 [28], POTE-actin [29,30,31], and PIPSL [32,33] genes. However, this type of fusion gene may actually be regarded as evolutionarily new genes, but not fusion ones [3]. ...

  1. Most RNAs from Two Neighboring Genes Should Not Be Deemed as Chimeras .....
  2. RT or PCR Creates Many Artifacts that Fabricate “Trans-Splicing” .....
  3. There Are Other Artifacts with Unknown Mechanisms .......... Currently, cDNA Protection Assay Is the Best Approach for Verification of Chimeric RNAs ...
  4. There Are Other Artifacts with Unknown Mechanisms ... . Many of these chimeric RNAs in normal cells are thought to be derived from trans-splicing events [7,46,50,54,55], which, however, has hardly received unimpeachable experimental evidence, due largely to technical constraints, as described later. In our opinion, trans-splicing occurs only as rare events in cells of evolutionarily high animals, and in the physiological situation in humans, the events are probably as scarce as hen’s teeth.