pombase / curation

PomBase curation
7 stars 0 forks source link

Unknowns figure :comparative histogram pombe/cerevisae/human #1960

Closed ValWood closed 6 years ago

ValWood commented 6 years ago

follow on from https://github.com/pombase/curation/issues/1831

We should add

~defense response we should maybe include this (or a decendent) because it seems to be independent of the immune response ..._ In deed to look into this one further~

Good job !

ValWood commented 6 years ago

These 2 terms which were not slimmed, I have a feeling are transposon derived? I thought these were filtered? @Antonialock Could you look into this ?

DNA integration Q9QC07, Q9NXP7, Q9P2P1 DNA biosynthetic process P63128, P04053, Q6UWI2, Q9QC07 nucleic acid-templated transcription C9JCN9, Q3ZLR7, Q9UHA2

If so is there a way to filter (i.e Uniprot need to update their gene set?)

Antonialock commented 6 years ago

No I'm not happy to argue against transposon annotations. They have been shown to play a role in human biology. I included them in my slim at first and then you asked me to remove the terms.

Antonialock commented 6 years ago

Also I had added defense response, which you told me tor emove :-) Which one is it?

Why do you think " inositol phosphate metabolic process" is a "biologically informative term"? It could be part of both signaling and macromolecule metabolism, which are very different things?

ValWood commented 6 years ago

No I'm not happy to argue against transposon annotations.

This isn't about "transposon annotation" but about including transposon encoding "genes", in the human protein set. In the pombe protein set we are excluding transposons In the S. cerevisiae dataset we are excluding transpons We should therefore exclude transposons from the human set. (this makes sense, because different organisms have different numbers of transposons and this can grossly affect the comparisons)

They are relevant to an organisms biology (they are clearly important for evolution), but in this instance we are trying to compare like-for-like and that excludes "transposons". If we don't do this "normalization", we have over-inflated numbers for processes which are shared with transposons.

These UniProt entries appeared to be transposon derived (I could not quite work out what they were). Perhaps they aren't, it would be good to check.

ValWood commented 6 years ago

Also I had added defense response, which you told me tor emove :-) Which one is it?

My fault. I can't remember that, but I probably though immune response would cover it, I didn't know it was a completely separate thing!

ValWood commented 6 years ago

Why do you think " inositol phosphate metabolic process" is a "biologically informative term"?

This one is a bit subtle. The genes which are involved in IP metabolism could be involved in either/or signaling and macromolecule metabolism, but they are well characterised genes...

I think people would find it odd to see inositol-1,5-bisdiphosphate-2,3,4,6-tetrakisphosphate 1-diphosphatase activity for example (this might not be a good example), classified as an unknown?

...they probably always do both, depending on context? so although this annotation is not very specific, I think we have to say that we know enough here to know that anything annotated to this particular term is "known"?

This is not the same as for- say "cell growth" or "cell proliferation" or even any of the "modification" terms we omitted, which are always "context dependent". I think the distinction is that, say for a protein kinase, any individual instance could be involved in any specific process (so it is not specific), but for these "IP metabolism" activities they are involved in both metabolism and signalling.

ValWood commented 6 years ago

i.e for the IP molecules the gene products are multifunctional...I think that's the difference...does that make sense?

ValWood commented 6 years ago

This isn't about "transposon annotation" it is about including transposon encoding "genes", in the human protein set.

To clarify: I'm not saying that UniProt should not include transposons, but that it should be possible to access a protein set that excludes them. I thought that was what we were using? The above genes might have slipped through. Otherwise they seem to be annotated as if they are transposons, so this needs to be queried ( or, that was my initial interpretation from the descriptions, and the GO annotation).

ValWood commented 6 years ago

A reason for excluding transposons: https://github.com/geneontology/go-annotation/issues/1869 look at the second table. I think the numbers for DNA recombination should be equivalent based on endogenous proteins. If you include transposons you would not see this because all transposons are annotated to "DNA recombination" because of their integrase. That is probalby why human "recombination " is higher (I didn't look into it yet, it's a guess).

There are likely to be other annotations to transposons that we would want to exclude to normalize numbers of annotations to processes.

This has nothing to do with the fact that they affect human biology, it just isn't what we are trying to show which is basically annotation coverage for non-transposon proteins.

ValWood commented 6 years ago

My fault. I can't remember that, but I probably though immune response would cover it, I didn't know it was a completely separate thing!

In fact I started writing a really long SF ticket about this until I realised "oh right they aren't necessarily related".....

ValWood commented 6 years ago

These UniProt entries appeared to be transposon derived (I could not quite work out what they were). Perhaps they aren't, it would be good to check.

My other suspicion was that these were not "transposon derived" but were incorrectly annotated because they were something else. I hope you will see the same when you look at them....

ValWood commented 6 years ago

I will leave defense response out, and recommend this as not for direct annotation! I see you included children, it is non-specific!

ValWood commented 6 years ago

Currently

5915 genes 217 not annotated in the slim. I have been through this list and I am satisfied that most really are no process (all are modification, response to or oxidation-reduction process). In fact most of these (over 100) have ND for process at SGD. no none root annotation 794 using our criteria these are ALL unknown (1011/5915)

pombe Your input list contains 5070 gene

These 9 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim: SPAC6G9.13c SPBC2G2.09c SPBP4G3.02 SPBC20F10.10 SPBC1347.11 SPAC2G11.15c SPCC1795.09 SPAC1002.07c SPBP4H10.17c (looking into this but all are known)

These 734 identifiers had no non-root annotations: so for the sake of argument I’ll include the 9 unmapped in unknown since I did this for S.c so for pombe the equivalent is 743/5070 (this will go down with the next GO update, and when I include C-terminal protein lipidation)

I will do the final update next week for all 3....

ValWood commented 6 years ago

So @Antonialock your final task for this part:

Antonialock commented 6 years ago

nucleic acid-templated transcription C9JCN9, Q3ZLR7, Q9UHA2 FP inference from "transcription activator/cofactor/repressor activity" no viruses annotated to this MF...but I guess there could be viral proteins that acts as transcriptional cofactors?

PO4053 DNA biosynthesis looks ok, it does nontemplated addition of nucleotides to exons (immune system development)

Q9P2P1- no idea about this one "The gene encoding this protein may have arisen from the fusion of a cellular gene with retroviral sequences prior to the marsupial-eutherian split. Sequence and structural analyses suggest that the integrase catalytic domain is inactive."

rest look like viral stuff, can look more in detail tomorrow and try and figure out how many are in the complete list.

Antonialock commented 6 years ago

So @Antonialock your final task for this part:

document here which human set was used, and check up on the transposon issue (are they in the dataset or out, they should be excluded for our purposes?)

The human list I used is still the uniprot curated 1:1 list of human genes that are "believed to exist" retrieved by using the search: NOT existence:uncertain AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" AND proteome:up000005640 in UniProt

Antonialock commented 6 years ago

so there are 88 genes that contain "retrovir" in the protein name, and 28 that contain "retrotransp" in the protein name = 0.6%

ValWood commented 6 years ago

so if there are fewer than 100 transposons in there I guess we can ignore them.... However, if we want to do "comparative slims" we should exclude them, as they will "over-inflate" annotations to transposon related processes (like recombination).

Don't UniProt provide a list without transposons? I thought when you asked on that GO thread that ex transposons was one of the options?

ValWood commented 6 years ago

I agree the ones you checked look OK....why did I flag those? not sure...these are clearly endogenous genes

Antonialock commented 6 years ago

ok I deleted 71 genes

I kept the ones with description "Retrotransposon Gag-like protein" because they are described as being derived from retrotransposon but now have actual functions e.g. see Q5HYW3, and some others that seemed to be gene fusions etc with retrotranposons but now are "everyday" functional things

Deleted:

Q9UN81 LORF1_HUMAN LINE-1 retrotransposable element ORF1 protein (L1ORF1p) (LINE retrotransposable element 1) (LINE1 retrotransposable element 1) L1RE1 LRE1
O00370 LORF2_HUMAN LINE-1 retrotransposable element ORF2 protein (ORF2p) [Includes: Reverse transcriptase (EC 2.7.7.49); Endonuclease (EC 3.1.21.-)]
Q5T7N2 LITD1_HUMAN LINE-1 type transposase domain-containing protein 1 (ES cell-associated protein 11) L1TD1 ECAT11
Q9NXP7 GIN1_HUMAN Gypsy retrotransposon integrase-like protein 1 (GIN-1) (Ty3/Gypsy integrase 1) (Zinc finger H2C2 domain-containing protein) GIN1 TGIN1 ZH2C2
P0CF75 EBLN1_HUMAN Endogenous Bornavirus-like nucleoprotein 1 (Endogenous Borna-like N element-1) (EBLN-1) EBLN1
Q6P2I7 EBLN2_HUMAN Endogenous Bornavirus-like nucleoprotein 2 (Endogenous Borna-like N element-2) (EBLN-2) EBLN2 GK006
Q14264 ENR1_HUMAN Endogenous retrovirus group 3 member 1 Env polyprotein (ERV-3 envelope protein) (ERV3 envelope protein) (ERV3-1 envelope protein) (Envelope polyprotein) (HERV-R envelope protein) (ERV-R envelope protein) (HERV-R_7q21.2 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERV3-1 ERV3
P60507 EFC1_HUMAN Endogenous retrovirus group FC1 Env polyprotein (Envelope polyprotein) (Fc1env) (HERV-F(c)1_Xq21.33 provirus ancestral Env polyprotein) (HERV-Fc1env) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVFC1
P60608 EFC2_HUMAN Endogenous retrovirus group FC1 member 1 Env polyprotein (Envelope polyprotein) (Fc2deltaenv) (HERV-F(c)2_7q36.2 provirus ancestral Env polyprotein) [Includes: Surface protein (SU); Truncated transmembrane protein (TM)] ERVFC1-1
P87889 GAK10_HUMAN Endogenous retrovirus group K member 10 Gag polyprotein (HERV-K10 Gag protein) (HERV-K107 Gag protein) (HERV-K_5q33.3 provirus ancestral Gag polyprotein) (Gag polyprotein) ERVK-10
P61580 NP10_HUMAN Endogenous retrovirus group K member 10 Np9 protein (HERV-K10 Np9 protein) (HERV-K107 Np9 protein) (HERV-K_5q33.3 provirus Np9 protein) ERVK-10
P10266 POK10_HUMAN Endogenous retrovirus group K member 10 Pol protein (HERV-K10 Pol protein) (HERV-K107 Pol protein) (HERV-K_5q33.3 provirus ancestral Pol protein) [Includes: Reverse transcriptase (RT) (EC 2.7.7.49); Ribonuclease H (RNase H) (EC 3.1.26.4); Integrase (IN)] ERVK-10
P10265 VPK10_HUMAN Endogenous retrovirus group K member 10 Pro protein (HERV-K10 Pro protein) (HERV-K107 Pro protein) (HERV-K_5q33.3 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) ERVK-10
P63124 VPK04_HUMAN Endogenous retrovirus group K member 104 Pro protein (HERV-K104 Pro protein) (HERV-K_5q13.3 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) HERV-K104
P61576 REC04_HUMAN Endogenous retrovirus group K member 104 Rec protein (HERV-K104 Rec protein) (HERV-K_5q13.3 provirus Rec protein) HERV-K104
Q9UQG0 POK11_HUMAN Endogenous retrovirus group K member 11 Pol protein (HERV-K_3q27.2 provirus ancestral Pol protein) [Includes: Reverse transcriptase (RT) (EC 2.7.7.49); Ribonuclease H (RNase H) (EC 3.1.26.4); Integrase (IN)] ERVK-11
Q902F9 EN113_HUMAN Endogenous retrovirus group K member 113 Env polyprotein (EnvK5 protein) (Envelope polyprotein) (HERV-K113 envelope protein) (HERV-K_19p13.11 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] HERVK_113
P62684 GA113_HUMAN Endogenous retrovirus group K member 113 Gag polyprotein (HERV-K113 Gag protein) (HERV-K_19p13.11 provirus ancestral Gag polyprotein) (Gag polyprotein) HERVK_113
P63132 PO113_HUMAN Endogenous retrovirus group K member 113 Pol protein (HERV-K113 Pol protein) (HERV-K_19p13.11 provirus ancestral Pol protein) [Includes: Reverse transcriptase (RT) (EC 2.7.7.49); Ribonuclease H (RNase H) (EC 3.1.26.4); Integrase (IN)] HERVK_113
P63121 VP113_HUMAN Endogenous retrovirus group K member 113 Pro protein (HERV-K113 envelope protein) (HERV-K_19p13.11 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) HERVK_113
P61574 RE113_HUMAN Endogenous retrovirus group K member 113 Rec protein (HERV-K113 Rec protein) (HERV-K_19p13.11 provirus Rec protein) HERVK_113
Q9NX77 ENK13_HUMAN Endogenous retrovirus group K member 13-1 Env polyprotein (Envelope polyprotein) (HERV-K_16p13.3 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVK13-1
P61578 REC16_HUMAN Endogenous retrovirus group K member 16 Rec protein (HERV-K_10p14 provirus Rec protein) ERVK-16
O42043 ENK18_HUMAN Endogenous retrovirus group K member 18 Env polyprotein (Envelope polyprotein) (HERV-K(C1a) envelope protein) (HERV-K110 envelope protein) (HERV-K18 envelope protein) (HERV-K18 superantigen) (HERV-K_1q23.3 provirus ancestral Env polyprotein) (IDDMK1,2 22 envelope protein) (IDDMK1,2 22 superantigen) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVK-18
Q9QC07 POK18_HUMAN Endogenous retrovirus group K member 18 Pol protein (HERV-K(C1a) Pol protein) (HERV-K110 Pol protein) (HERV-K18 Pol protein) (HERV-K_1q23.3 provirus ancestral Pol protein) [Includes: Reverse transcriptase (EC 2.7.7.49); Ribonuclease H (RNase H) (EC 3.1.26.4)] ERVK-18
P63123 VPK18_HUMAN Endogenous retrovirus group K member 18 Pro protein (HERV-K(C1a) Pro protein) (HERV-K110 Pro protein) (HERV-K18 Pro protein) (HERV-K_1q23.3 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) ERVK-18
O71037 ENK19_HUMAN Endogenous retrovirus group K member 19 Env polyprotein (EnvK3 protein) (Envelope polyprotein) (HERV-K(C19) envelope protein) (HERV-K_19q11 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVK-19
Q9YNA8 GAK19_HUMAN Endogenous retrovirus group K member 19 Gag polyprotein (HERV-K(C19) Gag protein) (HERV-K_19q11 provirus ancestral Gag polyprotein) (Gag polyprotein) ERVK-19
Q9WJR5 POK19_HUMAN Endogenous retrovirus group K member 19 Pol protein (HERV-K(C19) Pol protein) (HERV-K_19q11 provirus ancestral Pol protein) [Includes: Reverse transcriptase (RT) (EC 2.7.7.49); Ribonuclease H (RNase H) (EC 3.1.26.4); Integrase (IN)] ERVK-19
P63120 VPK19_HUMAN Endogenous retrovirus group K member 19 Pro protein (HERV-K(C19) Pro protein) (HERV-K_19q12 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) ERVK-19
P61572 REC19_HUMAN Endogenous retrovirus group K member 19 Rec protein (HERV-K(C19) Rec protein) (HERV-K_19q11 provirus Rec protein) ERVK-19
P61565 ENK21_HUMAN Endogenous retrovirus group K member 21 Env polyprotein (EnvK1 protein) (Envelope polyprotein) (HERV-K_12q14.1 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVK-21
P62683 GAK21_HUMAN Endogenous retrovirus group K member 21 Gag polyprotein (HERV-K_12q14.1 provirus ancestral Gag polyprotein) (Gag polyprotein) ERVK-21
P63119 VPK21_HUMAN Endogenous retrovirus group K member 21 Pro protein (HERV-K_12q14.1 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) ERVK-21
P61571 REC21_HUMAN Endogenous retrovirus group K member 21 Rec protein (HERV-K_12q14.1 provirus Rec protein) ERVK-21
P61566 ENK24_HUMAN Endogenous retrovirus group K member 24 Env polyprotein (Envelope polyprotein) (HERV-K101 envelope protein) (HERV-K_22q11.21 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVK-24
P63145 GAK24_HUMAN Endogenous retrovirus group K member 24 Gag polyprotein (HERV-K101 Gag protein) (HERV-K_22q11.21 provirus ancestral Gag polyprotein) (Gag polyprotein) ERVK-24
P61581 NP24_HUMAN Endogenous retrovirus group K member 24 Np9 protein (HERV-K101 Np9 protein) (HERV-K_22q11.21 provirus Np9 protein) ERVK-24
P63129 VPK24_HUMAN Endogenous retrovirus group K member 24 Pro protein (HERV-K101 envelope protein) (HERV-K_22q11.21 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) ERVK-24
P61570 ENK25_HUMAN Endogenous retrovirus group K member 25 Env polyprotein (Envelope polyprotein) (HERV-K_11q22.1 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVK-25
P63136 POK25_HUMAN Endogenous retrovirus group K member 25 Pol protein (HERV-K_11q22.1 provirus ancestral Pol protein) [Includes: Reverse transcriptase (RT) (EC 2.7.7.49); Ribonuclease H (RNase H) (EC 3.1.26.4); Integrase (IN)] ERVK-25
P63125 VPK25_HUMAN Endogenous retrovirus group K member 25 Pro protein (HERV-K_11q22.1 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) ERVK-25
P61579 ERK25_HUMAN Endogenous retrovirus group K member 25 Rec protein (Endogenous retrovirus group K member 25) (HERV-K_11q22.1 provirus Rec protein) ERVK-25
Q9HDB8 ENK5_HUMAN Endogenous retrovirus group K member 5 Env polyprotein (Envelope polyprotein) (HERV-K(II) envelope protein) (HERV-K_3q12.3 provirus ancestral Env polyprotein) [Includes: Truncated surface protein (SU)] ERVK-5 ERVK5
Q9HDB9 GAK5_HUMAN Endogenous retrovirus group K member 5 Gag polyprotein (HERV-K(II) Gag protein) (HERV-K_3q12.3 provirus ancestral Gag polyprotein) (Gag polyprotein) ERVK-5 ERVK5
P61583 NP5_HUMAN Endogenous retrovirus group K member 5 Np9 protein (Endogenous retrovirus K protein 5) (HERV-K(II) Np9 protein) (HERV-K_3q12.3 provirus Np9 protein) ERVK-5 ERVK5
Q69384 ENK6_HUMAN Endogenous retrovirus group K member 6 Env polyprotein (EnvK2 protein) (Envelope polyprotein) (HERV-K(C7) envelope protein) (HERV-K(HML-2.HOM) envelope protein) (HERV-K108 envelope protein) (HERV-K_7p22.1 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVK-6 ERVK6
Q7LDI9 GAK6_HUMAN Endogenous retrovirus group K member 6 Gag polyprotein (HERV-K(C7) Gag protein) (HERV-K(HML-2.HOM) Gag protein) (HERV-K108 Gag protein) (HERV-K_7p22.1 provirus ancestral Gag polyprotein) (Gag polyprotein) ERVK-6 ERVK6
Q9BXR3 POK6_HUMAN Endogenous retrovirus group K member 6 Pol protein (HERV-K(C7) Pol protein) (HERV-K(HML-2.HOM) Pol protein) (HERV-K108 Pol protein) (HERV-K_7p22.1 provirus ancestral Pol protein) [Includes: Reverse transcriptase (RT) (EC 2.7.7.49); Ribonuclease H (RNase H) (EC 3.1.26.4); Integrase (IN)] ERVK-6 ERVK6
Q9Y6I0 VPK6_HUMAN Endogenous retrovirus group K member 6 Pro protein (HERV-K(C7) Pro protein) (HERV-K(HML-2.HOM) Pro protein) (HERV-K108 Pro protein) (HERV-K_7p22.1 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) ERVK-6 ERVK6
Q69383 REC6_HUMAN Endogenous retrovirus group K member 6 Rec protein (Central open reading frame) (c-orf) (cORF) (Endogenous retrovirus K protein 6) (HERV-K(C7) Rec protein) (HERV-K(HML-2.HOM) Rec protein) (HERV-K108 Rec protein) (HERV-K_7p22.1 provirus Rec protein) (K-Rev) (Rev-like protein) (Rev/Rex homolog) ERVK-6 ERVK6
P61567 ENK7_HUMAN Endogenous retrovirus group K member 7 Env polyprotein (Envelope polyprotein) (HERV-K(III) envelope protein) (HERV-K102 envelope protein) (HERV-K_1q22 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVK-7
P63130 GAK7_HUMAN Endogenous retrovirus group K member 7 Gag polyprotein (HERV-K(III) Gag protein) (HERV-K102 Gag protein) (HERV-K_1q22 provirus ancestral Gag polyprotein) (Gag polyprotein) ERVK-7
P61582 NP7_HUMAN Endogenous retrovirus group K member 7 Np9 protein (HERV-K(III) Np9 protein) (HERV-K102 Np9 protein) (HERV-K_1q22 provirus Np9 protein) ERVK-7
P63135 POK7_HUMAN Endogenous retrovirus group K member 7 Pol protein (HERV-K(III) Pol protein) (HERV-K102 Pol protein) (HERV-K_1q22 provirus ancestral Pol protein) [Includes: Reverse transcriptase (RT) (EC 2.7.7.49); Ribonuclease H (RNase H) (EC 3.1.26.4); Integrase (IN)] ERVK-7
P63131 VPK7_HUMAN Endogenous retrovirus group K member 7 Pro protein (HERV-K(III) Pro protein) (HERV-K102 Pro protein) (HERV-K_1q22 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) ERVK-7
Q902F8 ENK8_HUMAN Endogenous retrovirus group K member 8 Env polyprotein (EnvK6 protein) (Envelope polyprotein) (HERV-K115 envelope protein) (HERV-K_8p23.1 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVK-8
P62685 GAK8_HUMAN Endogenous retrovirus group K member 8 Gag polyprotein (HERV-K115 Gag protein) (HERV-K_8p23.1 provirus ancestral Gag polyprotein) (Gag polyprotein) ERVK-8
P63133 POK8_HUMAN Endogenous retrovirus group K member 8 Pol protein (HERV-K115 Pol protein) (HERV-K_8p23.1 provirus ancestral Pol protein) [Includes: Reverse transcriptase (RT) (EC 2.7.7.49); Ribonuclease H (RNase H) (EC 3.1.26.4); Integrase (IN)] ERVK-8
P63122 VPK8_HUMAN Endogenous retrovirus group K member 8 Pro protein (HERV-K115 Pro protein) (HERV-K_8p23.1 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) ERVK-8
P61575 RECK8_HUMAN Endogenous retrovirus group K member 8 Rec protein (HERV-K115 Rec protein) (HERV-K_8p23.1 provirus Rec protein) ERVK-8
Q9UKH3 ENK9_HUMAN Endogenous retrovirus group K member 9 Env polyprotein (EnvK4 protein) (Envelope polyprotein) (HERV-K(C6) envelope protein) (HERV-K109 envelope protein) (HERV-K_6q14.1 provirus ancestral Env polyprotein) [Cleaved into: Surface protein (SU); Transmembrane protein (TM)] ERVK-9
P63126 GAK9_HUMAN Endogenous retrovirus group K member 9 Gag polyprotein (HERV-K(C6) Gag protein) (HERV-K109 Gag protein) (HERV-K_6q14.1 provirus ancestral Gag polyprotein) (Gag polyprotein) ERVK-9
P63128 POK9_HUMAN Endogenous retrovirus group K member 9 Pol protein (HERV-K(C6) Gag-Pol protein) (HERV-K109 Gag-Pol protein) (HERV-K_6q14.1 provirus ancestral Gag-Pol polyprotein) [Includes: Protease (EC 3.4.23.50) (PR) (Retropepsin); Reverse transcriptase/ribonuclease H (EC 2.7.7.49) (EC 2.7.7.7) (EC 3.1.26.4) (p66 RT)] ERVK-9
P63127 VPK9_HUMAN Endogenous retrovirus group K member 9 Pro protein (HERV-K(C6) Pro protein) (HERV-K109 Pro protein) (HERV-K_6q14.1 provirus ancestral Pro protein) (EC 3.4.23.50) (Protease) (Proteinase) (PR) ERVK-9
P61573 REC9_HUMAN Endogenous retrovirus group K member 9 Rec protein (HERV-K(C6) Rec protein) (HERV-K109 Rec protein) (HERV-K_6q14.1 provirus Rec protein) ERVK-9
Q9H9K5 MER34_HUMAN Endogenous retrovirus group MER34 member 1 Env polyprotein (HERV-MER_4q12 provirus ancestral Env polyprotein) ERVMER34-1 LP9056
P60509 ERB1_HUMAN Endogenous retrovirus group PABLB member 1 Env polyprotein (Endogenous retrovirus group PABLB member 1) (Envelope polyprotein) (HERV-R(b) Env protein) (HERV-R(b)_3p24.3 provirus ancestral Env polyprotein) [Includes: Surface protein domain (SU); Transmembrane protein domain (TM)] ERVPABLB-1
P61550 ENVT1_HUMAN Endogenous retrovirus group S71 member 1 Env polyprotein (Envelope polyprotein) (HERV-T Env protein) (HERV-T_19q13.11 provirus ancestral Env polyprotein) [Includes: Surface protein (SU); Transmembrane protein (TM)] ERVS71-1
B6SEH8 ERVV1_HUMAN Endogenous retrovirus group V member 1 Env polyprotein (HERV-V_19q13.41 provirus ancestral Env polyprotein 1) ERVV-1 ENVV1
B6SEH9 ERVV2_HUMAN Endogenous retrovirus group V member 2 Env polyprotein (HERV-V_19q13.41 provirus ancestral Env polyprotein 2) ERVV-2 ENVV2
Antonialock commented 6 years ago

@ValWood should these ones also go? e.g. http://www.uniprot.org/uniprot/Q96MW7

Antonialock commented 6 years ago

how about http://www.uniprot.org/uniprot/Q9P215 ?

Antonialock commented 6 years ago

and http://www.uniprot.org/uniprot/Q6P3X8

(there are a few of these types)

ValWood commented 6 years ago

yes ideally we should exclude transposon derived, but not those which have evolved_from transposons. if this is too tricky, we can leave them in...

ValWood commented 6 years ago

especially if you are doing this manually, because you will only be deleting the unknown ones (I guess), but not the slimmed ones. Ideally ask UniProt if there is a list of human proteins which excludes transposons (presumably they maintain this list? , one would hope?)

Antonialock commented 6 years ago

uniprot suggested removing anything annotated to http://www.uniprot.org/keywords/KW-0814

That removes 71 entries Most look fine (e.g. the "Endogenous retrovirus group K member...)

however these are also removed, do you want them in? http://www.uniprot.org/uniprot/Q9UQF0 http://www.uniprot.org/uniprot/P60508 http://www.uniprot.org/uniprot/M5A8F1

it does NOT remove these 13 genes (include or exclude? O00370 P0CF75 Q17RP2 Q4W5G0 Q53EQ6 Q5T7N2 Q6B0B8 Q6NT04 Q6P2I7 Q8IY51 Q96MW7 Q9NXP7 Q9UN81

You can see the full list here: http://www.uniprot.org/uniprot/?query=NOT+existence%3Auncertain+AND+keyword%3A%22Transposable+element+%5BKW-0814%5D%22+AND+reviewed%3Ayes+AND+organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22+AND+proteome%3Aup000005640&sort=score

ValWood commented 6 years ago

I wonder of these are called things like "endogenous coat protein family" because they exist in virus also (but the virus is mirroring the human proteins, the naming is unfortunate!) OR if they are really the actual retroviral component....difficult to know unless the annotation is clear. I would ask UniProt how to get the definitive list (if it is possible), and point out any inconsistencies using the current method.........

But for our current purposes it does not matter so much, I think just use the filter they suggested...but ask the question for future refinement.....may as well get the ball rolling in the right direction...

Antonialock commented 6 years ago

how are you getting on with final update. do you need to wait for a fix?

ValWood commented 6 years ago

I need to wait until the annotation set is through to GO term mapper. I didn't have chance to check yet but I will do so before the end of this week.

ValWood commented 6 years ago

which ticket are the current figures in? I can't find them?

Antonialock commented 6 years ago

https://github.com/pombase/curation/issues/1831

ValWood commented 6 years ago

yep I realised it was in the closed ticket. I was looking at open...

ValWood commented 6 years ago

OK,

pombe 5070 genes 1 ambiguous 16 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim (class as unknown for our purposes, some are from PAINT) 662 identifiers had no non-root annotation (this is because PAINT maps some I guess), so its a bit lower, will check this

cerevisiae 5915 genes 1 identifiers were found to be unannotated: YCL054W-A (reported to SGD) 168 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim (I checked these over 3 times now, they are all "unknown process", eithe functions or PAINT issues) 765 identifiers had no non-root annotations

All using the same slim, using GO term mapper today I will put the slim, and the gene sets, and the term mapper outputs in a Google Docs folder.

human 19700 genes 2771 identifiers were found to be unannotated 614 identifiers were not annotated in the slim, but they had non-root annotations that were not in the slim (I think we are happy that these are all really 'process unknown' from previous checks? (phosphorylation, response to and the like) 219 identifiers had no non-root annotations

Antonialock commented 6 years ago

also you wrote human 19700 genes - should be 19690

NOT existence:uncertain NOT keyword:"Transposable element [KW-0814]" AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" AND proteome:up000005640

Antonialock commented 6 years ago

I'm confused by there being 19700 human genes. That's not what you get when retrieving genes with the filter NOT existence:uncertain NOT keyword:"Transposable element [KW-0814]" AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" AND proteome:up000005640

I tried to rerun the slim but the tool isn't working?

ValWood commented 6 years ago

Weren't there 19700 in the list you gave me?

I also need to wait to get the submitted gaf without the PAINT data from Midori. You can't filter the slim by evidence code...

Antonialock commented 6 years ago

but if you are happy to go with your numbers we can.

could you clarify the number of known / unknown (i dont understand what ambiguous means?) is it pombe unknowns 1+16+662 = 679 cerevisiae unknowns 1+168+765 = 934 human 2771 + 614 + 219 = 3604 ?

Antonialock commented 6 years ago

should have been 19690 human genes NOT existence:uncertain NOT keyword:"Transposable element [KW-0814]" AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" AND proteome:up000005640

ValWood commented 6 years ago

The human list I have has 19730.

I'm putting everything in the Google drive directory....

Antonialock commented 6 years ago

ok well then I don't know what is in your list, if you want the genes included in the human proteome excluding transposons you should have a list of 19690 (which you get if you search uniprot for NOT existence:uncertain NOT keyword:"Transposable element [KW-0814]" AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]" AND proteome:up000005640)

ValWood commented 6 years ago

48 comments on this ne, closing. I'll open a new ticket for final stuff

ValWood commented 6 years ago

I didn't open the final ticket- I'm doing that now. This is the final figure for the paper. I need to get it to Steve early next week. I might send without the current version of this figure.