ndaniel / fusioncatcher

Finder of Somatic Fusion Genes in RNA-seq data
GNU General Public License v3.0
141 stars 66 forks source link

Duplicated genes and masking #58

Closed mpschr closed 7 years ago

mpschr commented 7 years ago

Hi

I wanted to ask if there is an easy way to mask duplicated genes. There is a great deal of genes with equal or very similar sequences (e.g. http://dgd.genouest.org/list/ENSG00000117115/1/) causing reads to align to multiple loci and therefore flagged as duplicate reads and being discarded.

Is there an easy way for me to just remove all but one copy of such a gene of interest and let fusioncatcher work with just that version of the gene? I have samples, where I have blasted unaligned reads by hand to prove the presence of the fusion-reads of such cases, but I can't get fusion-catcher to call it.

Thanks for any help, Michael

ndaniel commented 7 years ago

Actually FusionCatcher has several mechanism for dealing with this kind of genes, which have very similar sequences, like for example a gene and its pseudogenes. For now the approach is semi-manual, which means that some genes are shielded.

In your specific case, for now it looks FusionCatcher does not do anything with PADI1-4 genes so one would need to tweak manually FusionCatcher in order to shield one gene, for example PADI1, by replacing the sequences of PADI2, PADI3, and PADI4 genes with As on the genome. This can be done like this:

  1. add the genomic coordinates of the genes which should be replaced with As on the genome in shield_genes.py at line 337.
  2. re-build all the files needed by FusionCatcher by running fusioncatcher-build.py
  3. use the built files from step 2 when running FusionCatcher.