hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
179 stars 56 forks source link

Feature request: plot custom gene list in Linx #70

Closed lkhilton closed 4 years ago

lkhilton commented 4 years ago

We're really loving your tools and the visualizations they produce. We work in lymphoid cancers and have several driver genes that are affected by recurrent translocations that don't create fusions (e.g. MYC-IGH) and would really love a way to easily plot these on the LINX diagrams. We'd love a feature where we can input a list of driver genes of interest and have them plotted onto relevant LINX cluster diagrams if they're in proximity to breakpoints. I hope you'll consider adding it.

Thanks for an excellent tool.

rdmorin commented 4 years ago

I concur. One way this might be most readily implemented is to consider the IGH, IGL and IGK genes as recurrent fusion partners and allowing a user-configurable distance allowed to exist between the genes on either end of the breakpoint to be considered "fused".

jonbaber commented 4 years ago

Hi, glad you are finding the tools useful.

The gene parameter allows you to add genes to a figure (see here for details). Is this sufficient?

lkhilton commented 4 years ago

Yes, this should definitely be sufficient. Sorry I missed this in your incredibly detailed documentation. Thanks again.

lkhilton commented 4 years ago

I downloaded the HMF gene panel for GRCh37 at the link you referenced above, and it looks like it's missing many IGH and IGL genes and all of IGK. Any idea why?

jonbaber commented 4 years ago

When I search for those genes in the file I can find them:

$ grep IGK ~/hmf/repos/hmftools/hmf-common/src/main/resources/genepanel/all_genes.37.tsv | wc
      11     187    1584
$ grep IGH ~/hmf/repos/hmftools/hmf-common/src/main/resources/genepanel/all_genes.37.tsv | wc
      33     533    4864
$ grep IGL ~/hmf/repos/hmftools/hmf-common/src/main/resources/genepanel/all_genes.37.tsv | wc
     142    2400   20781

Perhaps something went wrong with the download?

rdmorin commented 4 years ago

I'm sorry, we should have been more explicit. That file looks to contain a few IGH genes but most of them are missing. I don't see any for IGK (I think that grep is matching the PIGK gene). It wouldn't surprise me that the genes we are looking for are absent from this list. Ensembl assigns these a different biotype because they don't represent complete genes until after the somatic recombination of this locus. The biotype of the genes Laura is referring to are as follows:

IG_J_gene IG_V_gene IG_C_gene IG_D_gene

Some arbitrarily chosen example gene names are IGHJ6, IGLV3-32, IGHM, IGHD2-15 etc.

p-priestley commented 4 years ago

Ok. Understood now. We are not experts in lymphoid cancers so we had not looked into these biotypes in detail.

I will look into this and get back to you with a proposal for handling them after the holiday period.

rdmorin commented 4 years ago

Is there anything we can do to help with this? I expect the limitation here is that the hmf gene panel would need to be revised to include these genes.

DarioS commented 4 years ago

Is what you're describing basically a more general structural variant than enhancer hijacking?

rdmorin commented 4 years ago

The immunoglobulin rearrangements are certainly examples of enhancer hijacking. The location of the enhancers are not very well annotated, though. Using the immunoglobulin genes is a reasonable proxy for the hijacking. In other words, these would appear to be a fusion between one of the IG genes and an oncogene such as MYC, CCND1 or BCL2.

lkhilton commented 4 years ago

Any thoughts on how we can implement this functionality?

p-priestley commented 4 years ago

Sorry for the delay. This slipped off the radar. We have discussed and will make this available very shortly.

In the meantime as a work around you can manually edit the "LNX_VIS_GENE_EXONS.tsv" file yourself and add the gene definitions. You just need to add the exon definitions for the gene in question and also add the clusterId of the cluster you are drawing.

Example below:

SampleId ClusterId Gene Transcript Chromosome AnnotationType ExonRank ExonStart ExonEnd XXX 75 CDKN2A ENST00000498124 9 DRIVER 4 21968055 21968241 XXX 75 CDKN2A ENST00000498124 9 DRIVER 3 21968574 21968770 XXX 75 CDKN2A ENST00000498124 9 DRIVER 2 21970901 21971207 XXX 75 CDKN2A ENST00000498124 9 DRIVER 1 21974677 21974865

charlesshale commented 4 years ago

I've added the ability for the Linx Visualiser to load Ensembl data from the same Ensembl data cache which Linx loads. This has those genes which are missing from our internal gene list. The new config is optional but can be specified as: _gene_transcripts_dir /path_to_ensembl_datacache/

The change has been committed and will be released with 1.8 shortly.

charlesshale commented 4 years ago

See comment above.

lkhilton commented 4 years ago

Thanks! I look forward to testing it out.

DarioS commented 3 years ago

Is there a shortcut to specify any fusion between any pair of genes?

p-priestley commented 3 years ago

You can try manually configuring the LNX_VIS_FUSIONS.tsv file.

You just need to make sure the ClusterId/chromosome matches the cluster/chromosome you are plotting

SampleId ClusterId Reportable GeneNameUp TranscriptUp ChrUp PosUp StrandUp RegionTypeUp FusedExonUp GeneNameDown TranscriptDown ChrDown PosDown StrandDown RegionTypeDown FusedExonDown ABCD 89 true KIF5B ENST00000302418 10 32315968 -1 Intronic 15 RET ENST00000355710 10 43610176 1 Exonic 12

jamesdalg commented 2 years ago

You can try manually configuring the LNX_VIS_FUSIONS.tsv file.

You just need to make sure the ClusterId/chromosome matches the cluster/chromosome you are plotting

SampleId ClusterId Reportable GeneNameUp TranscriptUp ChrUp PosUp StrandUp RegionTypeUp FusedExonUp GeneNameDown TranscriptDown ChrDown PosDown StrandDown RegionTypeDown FusedExonDown ABCD 89 true KIF5B ENST00000302418 10 32315968 -1 Intronic 15 RET ENST00000355710 10 43610176 1 Exonic 12

What precisely does PosUp refer to? Is it the position of the actual break or the position of the exon that is broken? From which output file is it populated? I've looked in all of the linx (non-vis) output files and can't find a position to populate it with. I've populated every other field using linx.fusion.tsv, linx.breakend.tsv, and linx.svs.tsv. Also, what precisely is ExonRank in the vis_gene_exon file?

p-priestley commented 2 years ago

PosUp is the first breakend position in the genome. It can be sourced from the original SV vcf file.

ExonRank is the ranking of the exon in the transcript.

jamesdalg commented 2 years ago

PosUp is the first breakend position in the genome. It can be sourced from the original SV vcf file.

ExonRank is the ranking of the exon in the transcript.

There are different identifiers (gridss1f_104915b,purple_0, and unbalanced_0) in the linx.svs.tsv file. How does one find each of these in the respective VCFs? I can find the gridss1f_104915b one easily (this is the ID field in the purple SV VCF). These are easily extracted with vcftools, but the purple_0 and unbalanced_0 are less clear. If you can let me know how to get these breakpoints, that would help greatly.