owlcollab / owltools

OWLTools
BSD 3-Clause "New" or "Revised" License
108 stars 33 forks source link

golr loading should guard or at least warn against clobbering #134

Open cmungall opened 9 years ago

cmungall commented 9 years ago

See https://github.com/geneontology/amigo/issues/240

The golr loader should make sure that GAF A does not overwrite GPI-level metadata from GAF B. This could happen a few ways. We could use metadata in each GAF that specifies whether the GAF is an authoritative source of metadata for entities in itself GAF (by default they are; but Reactome would be declared not authoritative for UniProtKB)

At least the loader should keep a cache of IDs loaded, and if the same ID is loaded in a second set then it should report. This will make investigating https://github.com/geneontology/amigo/issues/240 easier

cmungall commented 9 years ago

OK, a very quick improvement:

We should add a QC check that symbols should never follow the deprecated $SYMBOL_$TAXCODE format (although ironically we should probably add symbols during the load to fix the ubiquitous can't find-my-gene issue). It should at least warn if it sees this as it suggests something suspect has crept in.

Human is highest priority so the check is to test the symbol field for the regex /_HUMAN$/

kltm commented 9 years ago

What would also really help would be seriously reducing the log levels to get at the meat - the Jenkins logs are unusable where they stand.

cmungall commented 9 years ago

@hdietze can we control this?

hdietze commented 9 years ago

My initial answer is no. At the moment the GAFs are loaded separately and we do not check during the load for already existing (bioentities) entities in Golr.

kltm commented 8 years ago

@hdietze and I talked a little bit more about this, and kind of feel that most solutions to this problem don't really scale--checking globally for bad incoming data is expensive either space or time-wise.

kltm commented 8 years ago

Discussed with @cmungall , will go with hackiest and easiest solution for now, and revisit later as needed: we'll load "trusted" sources after "untrusted" ones, so clobbering is at least favorible to the way we want the world.

cmungall commented 8 years ago

The proposed solution does not work for populating closures at the bioentity level

kltm commented 8 years ago

Okay, the reorder, while not a fix, should at least give a more expected result for mods in most situations, right?

Long-term as keeping this all in memory is probably not an option, would be to have a GAF processing step that bins entries into new "loadable" GAFs. @hdietze mentioned that @selewis was looking at something like that for PAINT GAFs anyways?

kltm commented 8 years ago

And, of course, for IEAs as well, a similar thing could be possible. Two birds with one stone, binning the GAFs would fix that as well as possibly make the ingestion of GAFs more bite-sized.

cmungall commented 8 years ago

Okay, the reorder, while not a fix, should at least give a more expected result for mods in most situations, right?

Correct. Can we do a test load?

Long-term as keeping this all in memory is probably not an option, would be to have a GAF processing step that bins entries into new "loadable" GAFs. @hdietze mentioned that @selewis was looking at something like that for PAINT GAFs anyways?

Yes

kltm commented 8 years ago

We can do a test load...but that would take our "production" amigo offline for a bit. If you don't mind a slightly different load, we can trigger "noctua/dev" amigo right now (as the cost of bumping that offline for a bit).

kltm commented 8 years ago

Talking to @hdietze, apparently the immediate issue with PomBase and the PAINT clobber should not have occurred in the first place, and is traceable to an issue in PAINT (tagged above).

That said, these issues are expected to come up anyways. Even if we get a PAINT fix, we then would have to do something similar for IEAs, and then again for random GAFs that we'd consume from various groups. Essentially, there needs to be a way that things are binned correctly within a calculation unit, and it seems like the easiest the to get there would be a global pre-process step.

kltm commented 8 years ago

Talking to @hdietze, apparently the immediate issue with PomBase and the PAINT clobber should not have occurred in the first place, and is traceable to an issue in PAINT (tagged above).

That said, these issues are expected to come up anyways. Even if we get a PAINT fix, we then would have to do something similar for IEAs, and then again for random GAFs that we'd consume from various groups. Essentially, there needs to be a way that things are binned correctly within a calculation unit, and it seems like the easiest the to get there would be a global pre-process step.

kltm commented 8 years ago

Talking to @hdietze, apparently the immediate issue with PomBase and the PAINT clobber should not have occurred in the first place, and is traceable to an issue in PAINT (tagged above).

That said, these issues are expected to come up anyways. Even if we get a PAINT fix, we then would have to do something similar for IEAs, and then again for random GAFs that we'd consume from various groups. Essentially, there needs to be a way that things are binned correctly within a calculation unit, and it seems like the easiest the to get there would be a global pre-process step.