pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

Checks for phenotype data (errors picked up by Monarch) #1154

Closed ValWood closed 7 months ago

ValWood commented 8 months ago

Errors found in pHAF file from https://github.com/monarch-initiative/monarch-app/issues/647

we should check for only a single ~entrance~ penetrance value per annotation, and no non-ascii characters

high,20 (fixed to 20 (%) 7580 fixed, I used a non ascii dash which got stripped, we will add a check for that) medium,high (fixed to high)

kimrutherford commented 7 months ago

What priority is this issue?

ValWood commented 7 months ago

Not super high because I think everything was fixed manually. But always good to do the QC items. Probably medium, but slot it in somewhere if it's quick

kimrutherford commented 7 months ago

I've had a look at penetrance values that are currently used so I know what to check for.

We have:

Where NUMBER is mostly an integer but there are a few weird ones like 48.98 and <98.2

ValWood commented 7 months ago

I think 48.98 and <98.2 are valid, penetrance is a percentage (although these are slightly weird, we should probably round them!).

kimrutherford commented 7 months ago

I've implemented that check now. I'll make sure it's OK in the next nightly load.

kimrutherford commented 7 months ago

although these are slightly weird, we should probably round them

There are only 25 penetrance values with 2 decimal places. If you think they are worth fixing I can sort out PMID:25210736 which is a PHAF file. The other 4 values are from Canto.

pmid penetrance
PMID:11780129 0.61
PMID:25210736 64.00
PMID:25210736 28.11
PMID:25210736 61.19
PMID:25210736 39.57
PMID:25210736 48.98
PMID:25210736 31.30
PMID:25210736 63.69
PMID:25210736 36.98
PMID:25210736 38.94
PMID:25210736 52.76
PMID:25210736 48.05
PMID:25210736 48.03
PMID:25210736 45.64
PMID:25210736 44.38
PMID:25210736 54.82
PMID:25210736 39.86
PMID:25210736 21.09
PMID:25210736 38.76
PMID:25210736 56.49
PMID:25210736 43.65
PMID:25210736 56.29
PMID:25993311 99.02
PMID:35658118 9.32
PMID:9398669 99.56
kimrutherford commented 7 months ago

I can sort out PMID:25210736 which is a PHAF file.

I got that wrong. There is a PHAF file for PMID:25210736 but that's not where those penetrance values are recorded. They are double mutants in this session: https://curation.pombase.org/pombe/curs/08c96f6f44e500f7/ro I'm happy to fix that session if you like.

ValWood commented 7 months ago

I think it makes sense to fix that session.

kimrutherford commented 7 months ago

I've rounded the values in the PMID:25210736 session to 1 decimal place. I was tempted to round them to 0 decimal places since they are quite approximate values. Let me if I should.

ValWood commented 7 months ago

Yes I think so, the decimal places are a bit crazy here.

kimrutherford commented 7 months ago

OK, I've round to 0 decimal places.

kimrutherford commented 7 months ago

Hi @ValWood

Could you have a look at these?

pmid value session
PMID:11780129 0.61 https://curation.pombase.org/pombe/curs/8ddb4e6192c755eb
PMID:25993311 99.02 https://curation.pombase.org/pombe/curs/536dc2e074eee139
PMID:35658118 9.32 https://curation.pombase.org/pombe/curs/e1f4d0eca71f1467
PMID:9398669 99.56 https://curation.pombase.org/pombe/curs/9ef4b740616f8870
ValWood commented 7 months ago

for this one has_penetrance 0.61% I will make 0.6 but we need the decimal place for some very low incidence chromosome segregation phenotypes (in WT errors are even closer to zero)

ValWood commented 7 months ago

OK, I rounded to one decimal place. There were more in the sessions. Most I rounded to no decimal places.

I only kept a decimal place for those which were between 0-1 % and 99-100 %

kimrutherford commented 7 months ago

I only kept a decimal place for those which were between 0-1 % and 99-100 %

Thanks. That makes sense.

I think this is done now.