Filtering criteria & downstream analysis

pmelsted / pizzly

Fast fusion detection using kallisto

BSD 2-Clause "Simplified" License

80 stars 10 forks source link

Filtering criteria & downstream analysis #9

Closed leiendeckerlu closed 7 years ago

leiendeckerlu commented 7 years ago

Hi there,

referring to https://github.com/pmelsted/pizzly/issues/2, I was wondering if you could explain some of the filtering criteria you are using to create the Sample.json from Sample.unfiltered.json ?
Also, is there currently a possibility to set these filtering criteria in an individual way?
Out of curiosity, do you mind sharing your downstream analysis pipeline with the pizzly output file? Since pizzly identifies a lot of false positives, I definitely have to do some additional filtering (most probably on paircounts ?) I'm currently looking into using R for this, but I'm not sure if that's the most elegant way. Thanks!

MattBashton commented 7 years ago

For what it's worth I have implemented a simple downstream JSON flattening / gene location annotation / distance calculation (where on same chr) script in R here. I don't explicitly filter the output, but you can use of course sort the final tab delimited output via splitcount or paircount columns as you see fit.

leiendeckerlu commented 7 years ago

Very cool, will definitely check that out! Thanks Matt!

pmelsted commented 7 years ago

There are two approaches to filtering, one is to set a read count minimum. Pizzly does this by requiring two pairs supporting a fusion junction or one split read. You can also run the flattening script in the latest version to get a TSV table that is easier to filter on.

The other approach is to run kallisto to quantify the fusion transcripts and select those which have a decent TPM support. An example pipelins is in the Snakefile in the test directory.