monarch-initiative / monarch-app

Monarch Initiative website and API
https://monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
18 stars 6 forks source link

Add frequency to PomBase gene to phenotype transform #647

Open kevinschaper opened 8 months ago

kevinschaper commented 8 months ago

Column 15 in the phaf format is described as:

Penetrance describes the proportion of a population that shows a cell-level phenotype. Penetrance data are represented as percents or entries from the in-house FYPO_EXT ontology (FYPO_EXT:0000001 = high; FYPO_EXT:0000002 = medium; FYPO_EXT:0000003 = low; FYPO_EXT:0000004 = full).

(the numbers preceding values below are counts)

The mapping to FYPO_EXT looks fairly clear here for these qualifier names:

5424 high
1391 medium
 991 low
 153 complete

Less clear for these:

   1 medium,high
   1 high,20

The FYPO_EXT definitions themselves don't give frequency ranges. For HPO frequency qualifiers, our sorting function takes the low value of the defined ranges, I'm not sure how I would map these to numeric values for sorting.

There are numerical ranges defined as well, some examples:

  10 60-70
   9 30-40
   6 5-30
   6 10-20
   5 70-80

For consistancy with HPO range qualifier behavior, I assume these would sort on the low value.

For sorting approximate frequencies, I would probably just strip the ~ and continue sorting on the low value

   1 ~8
   1 ~7580
   1 ~75
   1 ~70
   1 ~7
   1 ~66
   1 ~65
   1 ~60-70
   1 ~58
   1 ~52

(~7580 looks like it's meant to be ~75-80?)

Finally there are greater than and less than. I assume for the sake of sorting, we would just want to strip the > or < and alter the value slightly so that ">80" would sort above "80".

cc:@valwood

ValWood commented 8 months ago

Hi @kevinschaper

I don't think it is worth you including the fission yeast penetrance and specificity extensions in Monarch. These are probably only really useful to fission yeast researchers working on these genes.

I misunderstood what the frequencies referred to. I thought that multiple annotations to the same phenotype were going to be collapsed and a "frequency" assigned like a "tally".

For instance in cdc2 there are 387 phenotypes, but many of the annotations are identical (from different sources) e.g

Screenshot 2024-03-22 at 16 32 08

Is the "frequency" column intended to represent the frequency in a population? If so, it might be better to call it penetrance to be unambiguous?

The extensions in column 17 with the "assayed using" qualifier might be more useful because these link to the other gene entities that the mutant affects (making connections between other entities in the knowledge graph). a biological might be that gene A when mutated affects the localization, or transcript level, or modification of gene B. These could be useful for networks because >70% fission yeast genes have human orthologs.

I once sent an e-mail describing the aspects of fission yeast phenotypes data that would be most useful for informing human biology and hence for display in Monarch. I will see if I can find it.

I'm happy to meet up and discuss what might be most useful for Monarch with you @cmungall @monicacecilia

Sorry, this ticket is now about multiple things!

ValWood commented 8 months ago

Anyway, if you do decide to use penetrance these 3 will be fixed in tomorrow's export file

high,20 (fixed to 20 (%) 7580 fixed, I used a non ascii dash which got stripped, we will add a check for that) medium,high (fixed to high)