ndaniel / fusioncatcher

Finder of Somatic Fusion Genes in RNA-seq data
GNU General Public License v3.0
141 stars 67 forks source link

Coordinate lifted between hg38 and hg19 #145

Closed xiucz closed 4 years ago

xiucz commented 4 years ago

Hi, @ndaniel

I find the conversion from hg38 to hg19 is a little different from the result from https://genome.ucsc.edu/cgi-bin/hgLiftOver. Please see the example.

The DUX4's hg38 coordinate is 4:190175393, however, the lift result is a little strange. And I use UCSC LiftOver tool to convert IGH@'s hg38 coordinate(14:106719377) to hg19, it returns chr14:107174624-107174624. Although, I know the breakpoints of IGH are massive.

How did fusioncatcher deal with the conversion? Need I recheck the hg19.txt file?

 $ grep DUX4 final-list_candidate-fusion-genes.txt final-list_candidate-fusion-genes.hg19.txt
final-list_candidate-fusion-genes.txt:DUX4      IGH@    known,cancer    0       4       2       18      BOWTIE+STAR     4:190175393:+   14:106719377:+ ENSG00000260596 ENSG09000000018                 CGGGCAGAGCTCTCCTGCCTCTCCACCAGCCCACCCCGCCGCCTGACC*NNNNNNNNNNNNNNNNNNNNNGCCCCCCCCCCCCCCGCCGACCCCACCACCAAATCATTATAAAGCCCTG        intronic/---

final-list_candidate-fusion-genes.hg19.txt:DUX4 IGH@    known,cancer    0       4       2       18      BOWTIE+STAR     Un_gl000228:114224:+  not-converted+   ENSG00000260596 ENSG09000000018                 CGGGCAGAGCTCTCCTGCCTCTCCACCAGCCCACCCCGCCGCCTGACC*NNNNNNNNNNNNNNNNNNNNNGCCCCCCCCCCCCCCGCCGACCCCACCACCAAATCATTATAAAGCCCTG        intronic/---

Thank you very much.

ndaniel commented 4 years ago

Looking to the results of conversion LiftOver was not able to convert the hg38 coordinate 14:106719377 into hg19 coordinate. I guess that probably in this case LiftOver returns several hg19 coordinates for 4:190175393 and FusionCatcher just picks up one randomly.

xiucz commented 4 years ago

Hi,@ndaniel Thank you for your quick response, the length of chr14(hg19) is 107,349,540. I can accept your explanations for 4:190175393. But I cannot understand why LiftOver was not able to convert the hg38 coordinate into hg19 coordinate. Actually, the UCSC returns: 111

Did I miss something? Can you explain more ?

Thank you very much.

xiucz

ndaniel commented 4 years ago

I think that the answer is because FusionCatcher is using the executable LiftOver which is a different version than the one from the UCSC Genome Web Browser.

Also the conversion in FusionCatcher is done one coordinate at the time (and not using intervals), which means something like:

liftOver chr4:190175393
liftOver chr14:106719377

and NOT

liftOver chr4:190175393-106719377
xiucz commented 4 years ago

@ndaniel Thank you, this may be the possible solution.

xiucz commented 4 years ago

Hi, @ndaniel

One more case,

The CLTC ref sequence "CTCTTCCTATGTTTTTGTTTTTTTTTGTTTTTTTTTTGTTTGTTTGTTTG" is consistent with hg38 systerm http://genome.ucsc.edu/cgi-bin/das/hg38/dna?segment=chr17:59644362,59644562, not hg19 http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=17:59643562,59644562.

So I think this may be a bug.

grep CLTC fusioncatcher/final-list_candidate-fusion-genes.txt fusioncatcher/final-list_candidate-fusion-genes.hg19.txt |grep ALK

fusioncatcher/final-list_candidate-fusion-genes.txt:CLTC        ALK     known,oncogene,cosmic,chimerdb2,cgp,ticdb,fragments,chimerdb3kb,chimerdb3pub,cancer,tumor      0       4       2       21      BOWTIE+STAR     17:59644562:+   2:29209551:-    ENSG00000141367 ENSG00000171094       CTCTTCCTATGTTTTTGTTTTTTTTTGTTTTTTTTTTGTTTGTTTGTTTG*TTTTTTTTTGAGACGGAGTTTCGCTCTTGTTGCCCAGGCTGGAGTGCCAT    intronic/intronic

fusioncatcher/final-list_candidate-fusion-genes.hg19.txt:CLTC     ALK     known,oncogene,cosmic,chimerdb2,cgp,ticdb,fragments,chimerdb3kb,chimerdb3pub,cancer,tumor      0       4       2       21      BOWTIE+STAR     17:57721923:+   2:29432417:-    ENSG00000141367 ENSG00000171094 CTCTTCCTATGTTTTTGTTTTTTTTTGTTTTTTTTTTGTTTGTTTGTTTG*TTTTTTTTTGAGACGGAGTTTCGCTCTTGTTGCCCAGGCTGGAGTGCCAT   intronic/intronic
ndaniel commented 4 years ago

Hi @xiucz

maybe BUT I think that there is also another bug in FC v1.10 (or lower version) which is that the reads with lower entropy region are not detected very well. This has been fixed in v1.20.

To me it looks like fusion junctions with very low entropy sequences like TTTTTGTTTTTTTTTGTTTTTTTTTTGTTTGTTTGTTTG*TTTTTTTTT and *NNNNNNNNNNNNNNNNNNNNNGCCCCCCCCCCCCCCGCC are very likely False Positive fusions. Also this kind of sequences are tricky to align on genome.

xiucz commented 4 years ago

Thank you for your advice, and I will try the lastest version.