openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Over expanded URIs #84

Open ChristineChichester opened 10 years ago

ChristineChichester commented 10 years ago

There are some URI's (issue found with geneIDs)that are expanded massively and when used in another call (target info for example) give a 500 error Examples:
http://identifiers.org/ncbigene/5327 http://identifiers.org/ncbigene/2242 http://identifiers.org/ncbigene/4193 http://identifiers.org/ncbigene/3248 http://identifiers.org/ncbigene/4803 http://identifiers.org/ncbigene/1610 http://identifiers.org/ncbigene/1312

These are expanded to all Trembl accession numbers. http://identifiers.org/ncbigene/5327 gives over 3k http://purl.uniprot.org/unique accession number, which then generates a huge query.

One solution may be to filter for only SP URI's

Chris-Evelo commented 10 years ago

Any idea why these expand to all trEMBL accession numbers? At least the first one looks like an ordinary gene to me. I suppose the linkset actually has this expansion? Is that because there is something strange in ENSEMBL?

Christian-B commented 10 years ago

The problem appears to come from the Ensembl based Ensembl to Uniport mappings http://identifiers.org/ncbigene/5327 maps to http://www.ensembl.org/id/ENSG00000104368 (and one more) But http://www.ensembl.org/id/ENSG00000104368 maps to many Uniprot ids

These come mainly from http://openphacts.cs.man.ac.uk/ims//originals/ensembl_2013-07-22/homo_sapiens_core_71_37_ensembl_uniprot.ttl

Which was a File Andra extracted from Uniport with some difficulty.

Christian-B commented 10 years ago

We may have to filter that mapping set to remove the URIs not known to be Swissport data.

Christian-B commented 10 years ago

Does anyone have a list of known Swissprot URis or Uniprot URIs known to OPS?

antonisloizou commented 10 years ago

The cache knows of 1149617 of them - could we do a similar solution as the CW-gene lense ?

How do you want the list ? Flat ? RDF ? VoID ?

Chris-Evelo commented 10 years ago

Removing the non-SwissProt URLs might be a practical way to it. But … even if they are from trEMBL there shouldn’t be that many! And I would prefer to find out why we have so many in the first place.

Andra is currently at home working on his thesis. And he can’t / shouldn’t do it. Will see whether we can come up with something else.

antonisloizou commented 10 years ago

Both develop and 1.4.0 are now patched and able to handle such large queries (if we ever legitimately get 3k mappings).

Christian-B commented 10 years ago

Note This is NOT fixed in IMS 1,4.1

leeharland commented 10 years ago

i have added this to the TTF agenda for thursday (8th may) - i'd also like to understand the "fix" too.

ChristineChichester commented 10 years ago

Here is one that seems still not to work: http://identifiers.org/ncbigene/3248

Christian-B commented 10 years ago

Lens Fix.

  1. Keep only the Uniport based Ensembl - Uniport mappings in the default lens, including all transitives using this linkset.
  2. Create a Lens which includes the Ensembl based Ensembl - Uniport mappings including all transitives using this linkset.
  3. keep all Ensembl - other Data Sources in the default lens, including their transitives.
  4. To do check if any of the above linksets are used in Chemisty Lens linksets.

Manual Fix. A. Generate a list of the Ensembl Ids with how many Uniport links. B. Check if these Ensembl Ids have Uniport based mappings C. Generate a list of Ensembl IDs in A but not B D. Extract from the Ensembl based linksets the Mappings where C. Ids map to just a few Uniport (say 5 or less) E. Generate a list of C. IDs which map to many Uniport ids. F. Chris and Christine to manually generate an Emsembl -> Swissport mappings for Ids identified in E G. Add mappings sets F and G into the Default lens.

Both fixes will be applied together.

Notes: There are no transitives with multiple Ensembl based linksets so no clash between above rules.

Chris-Evelo commented 10 years ago

The example given by Christine:

"Here is one that seems still not to work: http://identifiers.org/ncbigene/3248" might also give us a suggestion how we want to solve this in the long run.

If you follow the link to ENSEMBL and from there to the known transcripts you end up with this ENSEMBL table: http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000164120;r=4:175411328-175444305 that actually has less than 10 protein coding transcripts. So in an ideal world we would have all mappings from ENSEMBL to all UniProt proteins (most likely just 1 ) and all trEMBL entries (about 10) that are coded for by these transcripts. These should correspond to the ENSEMBL proteins in the lists. It is a bit more than just UniProt but not a lot more.

(The other direction could still have all trEMBL Ids just to map them to a gene in case somebody actually uses it in WikiPathways or reports it as a result of eg a proteomics experiment, where you can get part of sequence that is the blastped against trEMBL)

— Reply to this email directly or view it on GitHubhttps://github.com/openphacts/GLOBAL/issues/84#issuecomment-42546501.

Christian-B commented 10 years ago

I have done the Ensembl based count See: https://www.dropbox.com/s/0m12mz0t6gbqumr/enembleBased.xlsx

The highest 1 to M is ENSG00000198804 which mapps to over 300,000 Uniport Uris.

51 Ensembl URIs map over 10,000 times 169 Ensembl URIs map over 1,000 times 636 Ensembl URIs map over 100 times

169 and 636 include the numbers above.

Christian-B commented 10 years ago

I also noticed that there is NOT one Ensembl ID found in http://openphacts.cs.man.ac.uk/ims//linkset/version1.4.1/uniprot/uniprot_ensembl.ttl which is also in http://openphacts.cs.man.ac.uk/ims//linkset/version1.4.1/ensembl/uniprot.ttl Even when taking only the ID part of the URI in consideration.

Chris-Evelo commented 10 years ago

Could that be because the first was based on UniProt and thus maps to ENSEMBL proteins (starting with ENSP from the back of my head while in a train) and the other starts with ENSEMBL genes (which start with ENSG). I am actually not sure why we kept both mappings anyway. Since the first was only there as a quick fix while we were working on the latter.

Chris-Evelo commented 10 years ago

Since Christian reported that is indeed the case. Do we also have a linkset to map ENSEMBL genes (ENSG for human) to ENSEMBL proteins (ENSP) . In that case we could still use the UniProt to ENSEMBL mappings to find the UniProt IDs for the spurious ones.

leeharland commented 10 years ago

Thanks christian - hopefully christine & chris can have a look and see if we can create a better data fix to go with the IMS fixes you propose.

Chris-Evelo commented 10 years ago

Here are the two solutions (shortrun and longrun) that Christine and I came up with.

1) Shortrun First we completed the path from SwissProt to ENSEMBL gene.

Andra created two new linksets (without VOID headers, he really has no time for that) from the ENSEMBL database yesterday to link ENSEMBL-gene to ENSEMBL-transcript and from ENSEMBL-transcript-ENSEMBL-protein.

Since we already have SwissProt to ENSEMBL-protein we can use these two backwards to go all the way from SwissProt to ENSEMBL-gene. (Christian might want/need to make some extra transitives for that).

Now we have two solutions. 1) We look up all the ENSEMBL gene entries that cause problems (more than 100 links) when going from ENSEMBL-gene to UniProt (which includes trEMBL) and replace these by UniProt links found along the path described above. 2) We just create SwissProt to ENSEMBL gene for all SwissProt IDs and replace the whole linkset. I would prefer 1, since we might actually might have some useful trEMBL hits from known transcripts that are not covered by that SwissProt list.

2) Longrun We will start with 2 from above. And extend that. For the extension we will use all (transitive) links from ENSEMBL gene to ENSEMBL protein o find all known protein products in ENSEMBL. If we can find a linkset from these to trEMBL (or back) we will use that. But we probably won’t in that case we will take the sequence of all these transcripts (ca 100K expected) and BLAST these against trEMBL and look for full (99%) alignments. These we will add (or use as a separate additional linkset)

3) Solving the trEMBL in WikiPathways problem Procedures above might lead to us missing some trEMBL entries that occur in the (human part) of WikiPathways. We will first use the data to find which trEMBL entries those are (they will be UniProt entries in WikiPathways that are not in the new linkset). We will curate those and try to replace them with something that makes more sense. If that is not possible and we still want to keep them and we can find ENSEMBL-gene to trEMBL links we will add these links to linksets described.

andrawaag commented 10 years ago

The linksets I made yesterday are only from the human set. Tell me if it works and I'll make the linksets of the other species

Christian-B commented 10 years ago

Thanks Andra! To the best of my knowledge the issue is mainly if not purely from the human set, so don't waste your free time on other species until it is proven to be required.

Chris-Evelo commented 10 years ago

Also for other species the number of cases where we have no real SwissProt entry and thus indeed need the more "Long Run" approach (that will include trEMBL links when there are no SwissProt links) will become increasingly important for less well studied species.

Christian-B commented 10 years ago

I had a play with the linksets supplied by Andra/Chris Of the human Ensembl with over 5 Uniport links.

3909 are in the ENSG format but not in the supplied new linkset.

748 are in the supplied linksets but none of the ENST that links to are in the Uniport supplied linkset.

349 would be mappable to Uniport transitively.

88 have a different ID pattern not covered by the two linksets.

I can generate detail list of the URIs involved in the above.

Chris-Evelo commented 10 years ago

748 are in the supplied linksets but none of the ENST that links to are in the Uniport supplied linkset.

This one I don't understand. The UniProt supplied one map to ENSP, right? So you should go from ENST to ENSP first.

[It could in principle be that there are none indeed (no known protein transcript at all), but that would be really surprising given the high number of trEMBL mappings].

A few (10 or so) examples of each would be nice to explore this further.

Christian-B commented 10 years ago

349 where found ENSG -> ENST -> Uniprot I have not yet tried ENSG -> ENST -> ENSP -> Uniprot Which may find many of the 748

egonw commented 10 years ago

Christian, do we have a breakdown of those over 300,000 links for ENSG00000198804 from which link sets they originate? This should be derivable from all the VoID info that is being captured. I only realized this this morning, and have not checked what the API has to say about this provenance...