iobio / clin.iobio

Clin.iobio - Workflow and reporting for iobio variant analysis pipeline
9 stars 5 forks source link

Visualizations of gene results (genepanel connection) #401

Closed AlistairNWard closed 3 years ago

AlistairNWard commented 3 years ago

@adityaekawade We should come up with better visualizations of gene lists and place to hold them. We should have some back and forth to come up with improvements.

One constraint is that we don't want to force the gene list itself off the main view. After searching the gene list should always be visible - even if we have to scroll down to it.

Here is an opening thought - not everything is something we would do, or would do right away. Rework the cards after the gene list has been generated:

Asset 8

  1. Give separate cards to the inputs, GTR, Phenolzer and HPO and use styling from the modals.
  2. I got rid of the gray bar next to individual inputs
  3. I added an Actions menu to e.g., the HPO card. This would allow you to select multiple terms, click actions and remove all of them at once

If we want to make sure the gene list is always on the main page, there isn't room for all the inputs information and visualizations, unless they are moved below the gene list - which would mean nobody would ever see them. So we could have tabs or something to have two options above the gene list. The first option is as above showing the inputs. The second would to replace all of that with a visualization panel:

Asset 9

What are your thoughts? If this is a good direction, what do you thing would be the best way of providing a toggle between the two views? Tabs, buttons?

adityaekawade commented 3 years ago

We can add tabs that will be easy to switch between inputs and visualization views. Screen Shot 2021-04-26 at 10 52 16 AM

Some of my thoughts for the types of visualizations:

  1. Show top 5 genes occurring in most resources and searched terms using bar charts? Eg. for gene PRX - GTR: 3, Pheno - 6, HPO - 5
  2. We can move the Venn diagram in this section.
  3. If only HPO terms are selected, we can add a slider for selecting the genes occurring in N number of searched terms:

Screen Shot 2021-04-26 at 11 19 19 AM

I feel these are more filtering options or summarizing the list/ searched terms. Along with this, what other information can we show here that will provide additional value to the user? What else users might find interesting when they are building a gene list from phenotypes?

AlistairNWard commented 3 years ago

Yeah, I agree the tabs are probably the way to go. I'll add that to my mockups and start adding some visuals. I'm thinking that we amend what we show based on the resources used.

There is more filtering, but I bet we can come up with some useful stuff.

AlistairNWard commented 3 years ago

Ok @adityaekawade, here is another thought:

Asset 10

The INPUTS tab is what we have, but then we could have tabs for specific resources (we could also have a COMBINED tab for visualizations based on all resources). So, if we click on HPO, we would still see the HPO terms, but by getting rid of the inputs and info on GTR / Phenolyzer, we open up some space for a couple of charts. I couldn't be bothered to actually fuss with sizes, or make the charts different, but the idea is that one chart would be a horizontal bar chart that show N genes are present for exactly 7 HPO term, M genes for exactly 6 terms etc. The second chart would be cumulative, e.g. N genes are present in 4 or more HPO terms. These would be interactive and update the genes selected in the table.

Thoughts?

adityaekawade commented 3 years ago

@adityaekawade I added some inline comments.

Exactly as you described. This might not be useful, I'm just throwing out thoughts! The top chart would have the top bar representing genes present in exactly 4 HPO terms (11), the second bar would be genes in exactly 3 HPO terms (46) as you said. The user could interact with that chart to select the top two bars and thus get the 11+46 = 57 genes in 3 or more HPO terms, so you wouldn't really need the cumulative chart.

Screen Shot 2021-04-28 at 10 18 43 AM

This is a good point, and the point of this issue is to iron out issues before coming up with a final proposal to implement. The brushing would be fine if you've only inputted HPO terms, but this is an issue that comes up if there are multiple resources. We could include some controls in the top panel to filter the gene list. In this case, maybe we want a control to only view selected genes? Or Show selected genes at top of list?

I feel like it should be maintained, but we'd have to figure this out in detail first. It might be something we slowly build to. E.g. we could discard for now and incrementally build the functionality. We could start with just the HPO tab (since that's the most pressing), and then worry about expanding to the others after

AlistairNWard commented 3 years ago

What do you think @adityaekawade. Should we start to spec this out for development?

adityaekawade commented 3 years ago

Hi @AlistairNWard Based on our brief discussions and your mockups, I started experimenting with some rework and bar chart visualizations.
I have done some rework on the cards for inputs.

Screen Shot 2021-05-05 at 2 32 26 PM

I added tabs to switch between "inputs" and resources. We will first start with the "HPO".

I have given a separate card for input, GTR, phenolyzer, and HPO terms and am using the same style from the modals. A big challenge here was to align each card for a standard height (All cards adjust to the same height instead of each card having a separate height based on the number of terms in the respective card). Also got rid of the gray bar next to individual inputs.

I feel the search status is important to show here, so I have kept the success icon. If the request fails or if a term does not have genes, it will show the corresponding icon.

A delete icon on terms is currently shown only on hover, but I can change it to be always there.

The "Actions" menu on each card might have to be cramped to fit in considering the space. But if we switch the tab and go to "HPO" for example we can make this card a little bigger to include the actions menu..

Bar chart visualization:

Screen Shot 2021-05-05 at 2 19 18 PM

The X-axis represents the number of HPO terms and Y-axis represents the number of genes. So bar 2 shows the number of genes present in exactly 2 HPO terms. This chart is interactive and can be brushed to select the bars. In this screenshot bars, 3 and 4 are selected. This selects 58 genes from the table. These are the genes in 3 or more HPO terms (Please ignore the colors for now). We can also try inverting the axes here..

Let me know what you think.

AlistairNWard commented 3 years ago

Looks good. I have another idea for visualization for the HPO view. This probably gets us working more deeply with HPO which is good for our other patient matching project.

We could show charts of specificity and sensitivity. I haven't thought through the chart specifics yet, but this is the idea. You search on N HPO terms and get returned a bunch of genes. We have the chart above showing how many genes are associated with specific numbers of terms. Imagine taking the top gene. We know that this gene is associated with N HPO terms, but we could also look at how many HPO terms this gene is associated with. Some genes might be very specific, e.g. the gene is only associated with 1 HPO term, or not specific at all, e.g. the gene could be associated with 100 HPO terms. It would be useful to visualize this. A gene might only be associated with 1 HPO term, but that is an important term. Another gene might be associated with all N terms, but it is associated with 100 terms, so maybe that's actually less interesting. So a visualization that maybe looks at e.g. number of associated terms / total number of terms associated with genes etc could be interesting

adityaekawade commented 3 years ago

Let me know if I got this right?

Let's say, PRX is the top gene in our list. It is associated with 4 of our terms. But it is also associated with 102 other HPO terms. Now, there is another gene in the list, let's say RNDM. It is associated with 1 term (HP:0001234) from our selected terms but it is associated with only that term. This means it very specific to HP:0001234.

AlistairNWard commented 3 years ago

Exactly. So we could show some distributions of how many terms the genes are associated with, or charts of how specific genes are etc

AlistairNWard commented 3 years ago

When I get a moment, I'm going to start a list of questions that you should feel free to contribute to. These would be questions that we think a researcher might reasonably ask of the genes they have been given after supplying a set of HPO terms.

  1. How many genes are associated with all the HPO terms?
  2. How many genes are associated with at least N of the terms?
  3. How many genes are only associated with HPO terms in the list?
  4. How many genes are associated with so many terms as to be uninformative?
  5. What is the distribution of the number HPO terms the genes in the list are associated with?
adityaekawade commented 3 years ago

@AlistairNWard Here is a distribution chart (1st card in the screenshot below): Screen Shot 2021-05-07 at 2 06 14 PM

I have used the following data (added below). Consider this as the gene list compiled from 5 HPO terms: Eg. Gene ATP6 is associated with 4 HPO terms and it is associated with a total of 339 HPO terms. So calculated a percentage of 1.17 which can also be considered as a score.

Similarly PLP1 is associated with 4 of our HPO terms but it is associated with a total of just 85 terms. So it gets a specificity percentage (Score) of 4.70.

Next, we group this data into bins: We have 5 genes having specificity percentages between 0 and 1. Simiarly: 1 - 2 --> 8 genes 2 - 3 --> 1 gene 3 - 4 --> 4 genes 4 - 5 --> 1 gene

The histogram looks as follows: (The Y-axis should be number of genes instead of distribution)

Screen Shot 2021-05-07 at 2 21 25 PM

So higher the percentage, the more specific is the gene. Let me know what you think.

ATP6: {
  associated: 4, 
  total: 339,
  percentage: 1.17
},
GALC: {
  associated: 4, 
  total: 307,
  percentage: 1.30
},
PMP22: {
  associated: 4, 
  total: 182,
  percentage: 2.19
},
PRX: {
  associated: 4, 
  total: 102,
  percentage: 3.92
},
PLP1: {
  associated: 4, 
  total: 85,
  percentage: 4.70
},
PLP1: {
  associated: 4, 
  total: 358,
  percentage: 1.11
},
PLP1: {
  associated: 3, 
  total: 319,
  percentage: 0.94
},
ND1: {
  associated: 3, 
  total: 420,
  percentage: 0.71
},
ND1: {
  associated: 3, 
  total: 99,
  percentage: 3.03
},
KLC2: {
  associated: 3, 
  total: 92,
  percentage: 3.26
},
SPART: {
  associated: 3, 
  total: 210,
  percentage: 1.42
},
HNRNPA2B1: {
  associated: 2, 
  total: 138,
  percentage: 1.44
},
SNAP25: {
  associated: 2, 
  total: 232,
  percentage: 0.86
},
DHTKD1: {
  associated: 2, 
  total: 60,
  percentage: 3.33
},
MYOT: {
  associated: 2, 
  total: 143,
  percentage: 1.39
},
SCN9A: {
  associated: 1, 
  total: 358,
  percentage: 0.25
},
XRCC1: {
  associated: 1, 
  total: 55,
  percentage: 1.818
},
COX15: {
  associated: 1, 
  total: 141,
  percentage: 0.70
},
CIZ1: {
  associated: 1, 
  total: 63,
  percentage: 1.58
},
adityaekawade commented 3 years ago

We can use the above metric along with the genes associated with N terms of the selected HPO terms to rank the gene list.

We can put a weight (x2) to the number of terms from the selection.

For example, ATP6 is associated with 4 out of 5 selected terms. So 4/5 = 0.8 We add a weight of X2 to it. Thus 0.8 * 2 = 1.6 Next, we add this score to the specificity percentage. So the final score for this gene is 1.6 + 1.17 = 2.77

For KLC2, (3/5) * 2 = 1.2 SCORE = 1.2 + 3.26 = 4.46

For PRX: (4/5) * 2 = 1.6 Score: 1.6 + 3.92 = 5.52

So, PRX will be ranked 1st, KLC2 2nd, and ATP6 3rd.

This is just a thought, but we can think about assigns weights to these parameters and calculating a final score. This way we are considering the association of the gene to a HPO term, how specific is the gene to the HPO terms, and in many terms does that gene occur from the patient's phenotypes.

AlistairNWard commented 3 years ago

Ok, looking at ATP6, where does 339 associations come from? I looked on https://hpo.jax.org/app/browse/gene/4508 and there seem to be 133 HPO terms associated with ATP6.

This seems to make sense. I tried it a slightly different way. We could just take the ratio of matched terms to search terms and then just take (1 / total number of terms for gene) as a scaling factor. The more associations the gene has, the more it gets scaled down. So, we could take the score as score = matched terms / (search terms * total associated terms). Then scale by the gene with the highest score. So I took some of those genes:

ATP6 has 4 matched terms, 6 search terms and 133 associated terms giving a score of 0.0061. For the handful of genes I looked at, the one with the highest score was SBF2 with 0.0364. So the scaled score for ATP6 is 0.0061 / 0.0364 = 0.168:

ATP6 score = 0.0061, scaled 0.168 GALC score = 0.0066, scaled 0.181 PMP22 score = 0.0075, scaled 0.206 SBF2, score = 0.0364, scaled = 1.0 PRX, score = 0.00235, scaled = 0.0646 etc.

I think I got very different numbers of associated HPO terms. For PRX, I saw 34 associated HPO terms.

adityaekawade commented 3 years ago

@AlistairNWard Yeah, the numbers seem to be different. So basically I just wrote a small script to get all the HPO terms that have the gene PRX gene, and I got the following associated HPO terms (102 count for PRX). Earlier I had just did a cmd + F to find the count.

If you go to this link: https://hpo.jax.org/app/browse/gene/57716 There are only 34 terms associated with PRX as you mentioned. But there is no entry for Abnormal digit morphology (HP:0011297) in the table.

Next, go to: https://hpo.jax.org/app/browse/term/HP:0011297, View all genes and you will find PRX in the gene associations.

Here are the terms associated with PRX HPO Ids:

[
  'HP:0000006', 'HP:0000007', 'HP:0033127', 'HP:0000478', 'HP:0000496',
  'HP:0000639', 'HP:0410280', 'HP:0000707', 'HP:0000759', 'HP:0000762',
  'HP:0000924', 'HP:0000925', 'HP:0001155', 'HP:0001171', 'HP:0001178',
  'HP:0001252', 'HP:0001265', 'HP:0001270', 'HP:0001284', 'HP:0001288',
  'HP:0001290', 'HP:0001311', 'HP:0001315', 'HP:0001324', 'HP:0001425',
  'HP:0001604', 'HP:0001605', 'HP:0001608', 'HP:0100022', 'HP:0001760',
  'HP:0001761', 'HP:0001765', 'HP:0001780', 'HP:0100257', 'HP:0002011',
  'HP:0002355', 'HP:0002460', 'HP:0002650', 'HP:0002751', 'HP:0002808',
  'HP:0002813', 'HP:0002814', 'HP:0002817', 'HP:0002921', 'HP:0002922',
  'HP:0002936', 'HP:0003011', 'HP:0003130', 'HP:0003134', 'HP:0003202',
  'HP:0003376', 'HP:0003380', 'HP:0003382', 'HP:0003383', 'HP:0003387',
  'HP:0003400', 'HP:0003431', 'HP:0003470', 'HP:0003474', 'HP:0003481',
  'HP:0003593', 'HP:0003674', 'HP:0003677', 'HP:0003679', 'HP:0003690',
  'HP:0003693', 'HP:0003808', 'HP:0003812', 'HP:0003828', 'HP:0004302',
  'HP:0040064', 'HP:0040068', 'HP:0040129', 'HP:0040131', 'HP:0009027',
  'HP:0025456', 'HP:0009121', 'HP:0009127', 'HP:0009830', 'HP:0010674',
  'HP:0010831', 'HP:0010871', 'HP:0011096', 'HP:0011297', 'HP:0011442',
  'HP:0011804', 'HP:0011805', 'HP:0011842', 'HP:0011844', 'HP:0045010',
  'HP:0012373', 'HP:0012447', 'HP:0012547', 'HP:0012638', 'HP:0012639',
  'HP:0012758', 'HP:0012759', 'HP:0030177', 'HP:0030236', 'HP:0031797',
  'HP:0031801', 'HP:0031826'
]

Phenotypes:

[
  'Autosomal dominant inheritance',
  'Autosomal recessive inheritance',
  'Abnormality of the musculoskeletal system',
  'Abnormality of the eye',
  'Abnormality of eye movement',
  'Nystagmus',
  'Pediatric onset',
  'Abnormality of the nervous system',
  'Abnormal peripheral nervous system morphology',
  'Decreased nerve conduction velocity',
  'Abnormality of the skeletal system',
  'Abnormality of the vertebral column',
  'Abnormality of the hand',
  'Split hand',
  'Ulnar claw',
  'Hypotonia',
  'Hyporeflexia',
  'Motor delay',
  'Areflexia',
  'Gait disturbance',
  'Generalized hypotonia',
  'Abnormal nervous system electrophysiology',
  'Reduced tendon reflexes',
  'Muscle weakness',
  'Heterogeneous',
  'Vocal cord paresis',
  'Vocal cord paralysis',
  'Abnormality of the voice',
  'Abnormality of movement',
  'Abnormal foot morphology',
  'Pes cavus',
  'Hammertoe',
  'Abnormality of toe',
  'Ectrodactyly',
  'Morphological central nervous system abnormality',
  'Difficulty walking',
  'Distal muscle weakness',
  'Scoliosis',
  'Kyphoscoliosis',
  'Kyphosis',
  'Abnormality of limb bone morphology',
  'Abnormality of the lower limb',
  'Abnormality of the upper limb',
  'Abnormality of the cerebrospinal fluid',
  'Increased CSF protein',
  'Distal sensory impairment',
  'Abnormality of the musculature',
  'Abnormal peripheral myelination',
  'Abnormality of peripheral nerve conduction',
  'Skeletal muscle atrophy',
  'Steppage gait',
  'Decreased number of peripheral myelinated nerve fibers',
  'Hypertrophic nerve changes',
  'Onion bulb formation',
  'Decreased number of large peripheral myelinated nerve fibers',
  'Basal lamina onion bulb formation',
  'Decreased motor nerve conduction velocity',
  'Paralysis',
  'Sensory impairment',
  'Segmental peripheral demyelination/remyelination',
  'Infantile onset',
  'Onset',
  'Slow progression',
  'Pace of progression',
  'Limb muscle weakness',
  'Distal amyotrophy',
  'Abnormal muscle tone',
  'Phenotypic variability',
  'Variable expressivity',
  'Functional motor deficit',
  'Abnormality of limbs',
  'Abnormality of limb bone',
  'Abnormal nerve conduction velocity',
  'Abnormal motor nerve conduction velocity',
  'Foot dorsiflexor weakness',
  'Abnormal CSF protein level',
  'Abnormal axial skeleton morphology',
  'Abnormality of the musculature of the limbs',
  'Peripheral neuropathy',
  'Abnormality of the curvature of the vertebral column',
  'Impaired proprioception',
  'Sensory ataxia',
  'Peripheral demyelination',
  'Abnormal digit morphology',
  'Abnormal central motor function',
  'Abnormal muscle physiology',
  'Abnormal skeletal muscle morphology',
  'Abnormality of skeletal morphology',
  'Abnormal appendicular skeleton morphology',
  'Abnormality of peripheral nerves',
  'Abnormal eye physiology',
  'Abnormal myelination',
  'Abnormal involuntary eye movements',
  'Abnormal nervous system physiology',
  'Abnormal nervous system morphology',
  'Neurodevelopmental delay',
  'Neurodevelopmental abnormality',
  'Abnormality of peripheral nervous system electrophysiology',
  'Abnormality of muscle size',
  'Clinical course',
  'Vocal cord dysfunction',
  'Abnormal reflex'
]
adityaekawade commented 3 years ago

This seems to make sense. I tried it a slightly different way. We could just take the ratio of matched terms to search terms and then just take (1 / total number of terms for gene) as a scaling factor. The more associations the gene has, the more it gets scaled down. So, we could take the score as score = matched terms / (search terms * total associated terms). Then scale by the gene with the highest score. So I took some of those genes:

ATP6 has 4 matched terms, 6 search terms and 133 associated terms giving a score of 0.0061. For the handful of genes I looked at, the one with the highest score was SBF2 with 0.0364. So the scaled score for ATP6 is 0.0061 / 0.0364 = 0.168:

Yes, makes sense. We can create a chart based on this scaled score and also rank the gene list.

AlistairNWard commented 3 years ago

I assumed you weren't making up the numbers!! Would be good to understand why we're seeing a difference

adityaekawade commented 3 years ago

Yes, we will have to check why there is a difference. For example There are two terms : onset and infantile onset. Maybe they have counted it as one but both these have the same Gene.

AlistairNWard commented 3 years ago

Are you using the gene_to_phenotype file downloaded from jax? If so, I just quickly looked at the LMNA gene. If you pull out the number of LMNA lines, there are 705 lines. There are lots of duplicates though. I'm not sure what the Frequency-HPO term is, but there seem to be multiple values of this. For example:

Screen Shot 2021-05-13 at 1 10 30 PM

This HPO term appears three times for LMNA. Same id and term, but different Frequency-HPO and different orphanet disease id. These are the diseases: Atypical Werner syndrome; Hutchinson-Gilford progeria syndrome; Mandibuloacral dysplasia with type A lipodystrophy.

If I remove duplicate HPO ids, I get that LMNA has 417 unique HPO ids, and is associated with 25 disease (orphanet) terms. On the HPO website, there are 430 terms, so still not identical, but clearly we need to understand the differences. This could be that the download is not up to date with the website, vice versa, or something else.

Screen Shot 2021-05-13 at 1 35 14 PM

adityaekawade commented 3 years ago

I had compiled the list from this link: https://ci.monarchinitiative.org/view/hpo/job/hpo.annotations/lastSuccessfulBuild/artifact/rare-diseases/util/annotation/phenotype_to_genes.txt Its phenotype to gene file.

AlistairNWard commented 3 years ago

Yeah, that looks like the same file. So this is really just to say that we need to fully understand the data sources and files we use.

AlistairNWard commented 3 years ago

Ok, here are a couple of things that came out of todays meeting:

adityaekawade commented 3 years ago

@AlistairNWard A possible reason we see the number of associated terms different is that the web interface shows an HPO term for a gene but does not add its parent in the hierarchy. Example: https://hpo.jax.org/app/browse/gene/57716

PRX has motor delay as an associated HPO term. If you click on motor delay and check its hierarchy, its parent is Neurodevelopmental delay. It is present in the DB that I have compiled but is not shown on the HPO website for the PRX gene.

Screen Shot 2021-05-19 at 10 43 48 AM

Similarly, for another HPO terms Hyporeflexia, its parent is Reduced tendon reflexes. Hyporeflexia is shown on the website but Reduced tendon reflexes is not.

Screen Shot 2021-05-19 at 10 59 12 AM
AlistairNWard commented 3 years ago

If I run my script for PRX, I get 37 terms (the web interface has 34 - so I still want to know the reason for this difference, maybe just a versioning?). I have Motor delay in those terms, but I don't have 'Neurodevelopmental delay. I think this is correct. Otherwise, if someone has a very specific phenotype, they are going to be associated with loads of HPO terms since there will be a whole set of parent terms that get thrown in. If a term is associated withMotor delay`, it is obviously associated with the parent. We want to see how many distinct HPO terms there are, which will get into the matching project. We can think through how we want this to behave, and combine with measures of specificity.

adityaekawade commented 3 years ago

There is a column for Frequency HPO in this file: https://ci.monarchinitiative.org/view/hpo/job/hpo.annotations/lastSuccessfulBuild/artifact/rare-diseases/util/annotation/genes_to_phenotype.txt

It shows the frequency of the HPO term for the gene indicated by ID. These frequencies are categorized as: Occasional, Very frequent, Very rare, etc..

AlistairNWard commented 3 years ago

We are looking at the same file! I'm running my script on that same genes_to_phenotype.txt file. I just downloaded it directly from jax.

There are 2 frequency columns - frequency-raw and frequency-hpo. Do you know what they both are? It seems like they aren't always filled in though, so I'm not sure what they are referring to. For example, all 45 lines associated with PRX have no information in either of these columns

adityaekawade commented 3 years ago

There are two files. The first one I had shared was phenotype_to_genes.txt and this is genes_to_phenotypes.txt. Now, we are looking at the same file.

Right, the first column "frequency-raw" is almost always empty. The "frequency-hpo", when filled, usually shows one of the frequency categories indicated by an ID. So if the frequency HPO is HP:0040282 it refers to Frequent- Present in 30% to 79% of the cases.

Maybe there is not enough data for each hpo-gene association to categorize, hence not filled in.

From their documentation:

There are three allowed options for this field. A term-id from the HPO-sub-ontology below the term Frequency. A count of patients affected within a cohort. For instance, 7/13 would indicate that 7 of the 13 patients with the specified disease were found to have the phenotypic abnormality referred to by the HPO term in question in the study referred to by the DB_Reference A percentage value such as 17%, again referring to the percentage of patients found to have the phenotypic abnormality referred to by the HPO term in question in the study referred to by the DB_Reference. If possible, the 7/13 format is preferred over the percentage format if the exact data is available.

Ref: https://hpo.jax.org/app/help/annotations

AlistairNWard commented 3 years ago

It looks like this is useful information if it exists, but often doesn't. Probably something we'd want to include in the table, gene info card at a minimum

adityaekawade commented 3 years ago

Closing this issue as it has been addressed in recent releases.