pombase / curation

PomBase curation
7 stars 0 forks source link

disease gene mappings #3033

Closed ValWood closed 3 years ago

ValWood commented 3 years ago

after I closed this ticket (~7 june) https://github.com/pombase/curation/issues/3018

the number of disease genes was 1346. It has now dropped to 1329. Could I get the list of disease genes from 9/10/11 june so I can see what is missing?

ValWood commented 3 years ago

I only have these two errors. to fix, which will bring to 1331:

needed 6 columns, got 1 - ignoring line 50 can't load annotation for SPCC1183.03c, line 1327 - MONDO:0009245 not found in database

kimrutherford commented 3 years ago

the number of disease genes was 1346.

That doesn't match what I see when I look at older database versions.

I get:

I've attached a zip file with the disease gene lists from 6th, 7th and 9th. Those are the dates when the nightly load started so the main site would have those changes on the mornings of 7th, 8th and 10th. If you closed that issue ~7 june then you would have been looking at the data from the 6th. I've temporarily switched http://dev.pombase.kmr.nz/ to be the update from the 6th in case you need to dig deeper.

disease_genes-issue-1717.zip

ValWood commented 3 years ago

I'm a bit baffled.

This is the difference from the 7th to the current list: https://www.pombase.org/results/from/id/71ef7397-3093-47e1-8063-21b26c3ed684

The first 2 phs1, rps20 do not have a disease in OMIM, only 'variant of unknown significance' so I am not sure which disease they were mapped to, and I don't remember removing anything?

kimrutherford commented 3 years ago

I'm a bit baffled.

Me too.

I am not sure which disease they were mapped to

You can look them up on the dev site which is currently on the version from 2021-06-07: http://dev.pombase.kmr.nz/

In case it helps here are the genes and disease associations, straight from Chado from 2021-06-07 and 2021-06-23. Let me know if you'd like a different format. The lines in both files are sorted so you should be able diff them.

gene_and_disease-2021-06-07.tsv.txt gene_and_disease-2021-06-23.tsv.txt

ValWood commented 3 years ago

these have gone but I didn't see a log error: congenital fiber-type disproportion myopathy   | act1, cdc8, myo2, myp2, phs1, rlc1

ValWood commented 3 years ago

rps20 MONDO:0018604 | familial colorectal cancer type X | MalaCards | Rappaport N et al. (2017)

chz1, jmj2, lid2, msc1, rps401, rps402, rps403, sum3 MONDO:0010767 | spermatogenic failure, Y-linked, 2 | MalaCards | Rappaport N et al. (2017)

Diamond-Blackfan anemia   | rpl1801, rpl1802, rpl35, rps2201, rps2202

cryptogenic multifocal ulcerous stenosing enteritis   | plb1, SPAC1786.02, SPAC1A6.03c, SPAC977.09c, SPBC1348.10c, SPCC1450.09c

ValWood commented 3 years ago

dentin dysplasia type I   | vps4

acetyl-coa carboxylase deficiency   | cut6

Mobius syndrome   | rev3

gliosarcoma   | alp7 (this one looks familiar, I think I fixed that one last night)

ValWood commented 3 years ago

autosomal recessive spastic paraplegia type 60   | bun107

split hand or/and split foot malformation   | ede1, end3, irs4, ucp8

Leigh syndrome with leukodystrophy aim22, coq11, etp1, fmt1, pda1, sdh1, shy1, tac1, SPAC11E3.12, SPBC18E5.10, SPCC417.16

ValWood commented 3 years ago

spinocerebellar ataxia type 18   | SPBC20F10.03

myelodysplastic syndrome   | git5, idp1, ras1, uaf2

ValWood commented 3 years ago

Charcot-Marie-Tooth disease (MONDO:0015626) direct annotations: ala1, atp6, grs1, jnm1, sac3, sac32

ValWood commented 3 years ago

familial Alzheimer disease   | brc1, hob1, iph1, map1, mbx1, pef1, yap18

precursor T-cell acute lymphoblastic leukemia   | ccp1, not3, nup146, sum3, yap18

velocardiofacial syndrome   | bis1

ValWood commented 3 years ago

autosomal dominant non-syndromic intellectual disability (MONDO:0015802)

large list, only SPAC1851.03 | ckb1 | CK2 family regulatory subunit Ckb1 SPBC2G5.02c | ckb2 | CK2 regulatory subunit beta isoform 2, Ckb2 probably need remapping

cytochrome-c oxidase deficiency disease (MONDO:0009068) cfh4, coa3, cox1, cox10, cox12, cox14, cox2, cox20, cox3, cox6, pet117, sco1, shy1, tac1 only pet117 is not remapping

ValWood commented 3 years ago

fatal infantile hypertrophic cardiomyopathy due to mitochondrial complex I deficiency   | SPAC9E9.15

ValWood commented 3 years ago

thrombocytopenia 2   | ppk18 altthough this mapping looks incorrect because

An autosomal dominant disorder caused by mutation(s) in the ANKRD26 gene, encoding ANKRD26 protein. and this isn't the ortholog

kimrutherford commented 3 years ago

I think I've tracked down the problem but I don't understand it.

I noticed that all the missing genes I looked at were towards the end of the malacards data file. After but of debugging, I found that if I remove this line (that has a non-ascii character), the whole file loads without a problem:

attenuated chédiak-higashi syndrome    attenuated_chediak_higashi_syndrome     Attenuated Chediak-Higashi Syndrome     LYST

I don't know why this has happened when it's been working well for months. I must have changed something but I can't think what. I'll dig in tomorrow. I'll also add better error checking for the malacards loader.

In the meantime, I've removed that line as it doesn't have a MONDO ID. Hopefully tonight's load will be better.

ValWood commented 3 years ago

These are the rest: SPBC1709.11c png2 ING family histone acetyltransferase complex PHD-type zinc finger subunit Png2 SPAC607.04 arg82 inositol polyphosphate kinase Arg82 (predicted) SPBP4H10.11c lcf2 long-chain-fatty-acid-CoA ligase SPCC794.12c mae2 malic enzyme, malate dehydrogenase (oxaloacetate decarboxylating), Mae2 SPAC24H6.01c gup1 membrane bound O-acyltransferase, MBOAT Gup1 (predicted) SPBC21D10.11c nfs1 mitochondrial [2Fe-2S] cluster assembly and tRNA modification cysteine desulfurase Nfs1 SPBC17A3.07 pgr1 mitochondrial glutathione reductase Pgr1 SPAC23H3.08c bub3 mitotic spindle checkpoint WD repeat protein Bub3 SPAC750.08c NAD-dependent malic enzyme (predicted), partial SPBC337.08c ubi4 protein modifier, ubiquitin SPNCRNA.82 mrp1 RNAse MRP SPNCRNA.214 ter1 telomerase RNA SPCC1919.14c bdp1 transcription factor TFIIIB complex subunit Bdp1 (predicted) SPBC25H2.07 tif11 translation initiation factor eIF1A SPAP8A3.12c tpp2 tripeptidyl-peptidase II Tpp2

One the trouble-shooting is done could you transfer this ticket to the curation tracker because I suspect a few of these really do need remapping.

kimrutherford commented 3 years ago

Much better! There are now 1383 disease genes: https://www.pombase.org/results/from/id/58ea50d0-d07d-4933-89d8-fca9adf2f2cf

kimrutherford commented 3 years ago

I've fixed the code so this won't happen again. It was caused by changes I made two weeks ago to get the loading working on my desktop after an upgrade.

ValWood commented 3 years ago

Perfect! not far from 1400, I have a few more up my sleeve ;) (I have been working through a list from Alliance/cerevisiae, most are not causal but about 20% rate and I have about 40 left to check...)

ValWood commented 3 years ago

All present and correct.....