Open cmungall opened 9 years ago
OK, a very quick improvement:
We should add a QC check that symbols should never follow the deprecated $SYMBOL_$TAXCODE
format (although ironically we should probably add symbols during the load to fix the ubiquitous can't find-my-gene issue). It should at least warn if it sees this as it suggests something suspect has crept in.
Human is highest priority so the check is to test the symbol field for the regex /_HUMAN$/
What would also really help would be seriously reducing the log levels to get at the meat - the Jenkins logs are unusable where they stand.
@hdietze can we control this?
My initial answer is no. At the moment the GAFs are loaded separately and we do not check during the load for already existing (bioentities) entities in Golr.
@hdietze and I talked a little bit more about this, and kind of feel that most solutions to this problem don't really scale--checking globally for bad incoming data is expensive either space or time-wise.
Discussed with @cmungall , will go with hackiest and easiest solution for now, and revisit later as needed: we'll load "trusted" sources after "untrusted" ones, so clobbering is at least favorible to the way we want the world.
The proposed solution does not work for populating closures at the bioentity level
Okay, the reorder, while not a fix, should at least give a more expected result for mods in most situations, right?
Long-term as keeping this all in memory is probably not an option, would be to have a GAF processing step that bins entries into new "loadable" GAFs. @hdietze mentioned that @selewis was looking at something like that for PAINT GAFs anyways?
And, of course, for IEAs as well, a similar thing could be possible. Two birds with one stone, binning the GAFs would fix that as well as possibly make the ingestion of GAFs more bite-sized.
Okay, the reorder, while not a fix, should at least give a more expected result for mods in most situations, right?
Correct. Can we do a test load?
Long-term as keeping this all in memory is probably not an option, would be to have a GAF processing step that bins entries into new "loadable" GAFs. @hdietze mentioned that @selewis was looking at something like that for PAINT GAFs anyways?
Yes
We can do a test load...but that would take our "production" amigo offline for a bit. If you don't mind a slightly different load, we can trigger "noctua/dev" amigo right now (as the cost of bumping that offline for a bit).
Talking to @hdietze, apparently the immediate issue with PomBase and the PAINT clobber should not have occurred in the first place, and is traceable to an issue in PAINT (tagged above).
That said, these issues are expected to come up anyways. Even if we get a PAINT fix, we then would have to do something similar for IEAs, and then again for random GAFs that we'd consume from various groups. Essentially, there needs to be a way that things are binned correctly within a calculation unit, and it seems like the easiest the to get there would be a global pre-process step.
Talking to @hdietze, apparently the immediate issue with PomBase and the PAINT clobber should not have occurred in the first place, and is traceable to an issue in PAINT (tagged above).
That said, these issues are expected to come up anyways. Even if we get a PAINT fix, we then would have to do something similar for IEAs, and then again for random GAFs that we'd consume from various groups. Essentially, there needs to be a way that things are binned correctly within a calculation unit, and it seems like the easiest the to get there would be a global pre-process step.
Talking to @hdietze, apparently the immediate issue with PomBase and the PAINT clobber should not have occurred in the first place, and is traceable to an issue in PAINT (tagged above).
That said, these issues are expected to come up anyways. Even if we get a PAINT fix, we then would have to do something similar for IEAs, and then again for random GAFs that we'd consume from various groups. Essentially, there needs to be a way that things are binned correctly within a calculation unit, and it seems like the easiest the to get there would be a global pre-process step.
See https://github.com/geneontology/amigo/issues/240
The golr loader should make sure that GAF
A
does not overwrite GPI-level metadata from GAFB
. This could happen a few ways. We could use metadata in each GAF that specifies whether the GAF is an authoritative source of metadata for entities in itself GAF (by default they are; but Reactome would be declared not authoritative for UniProtKB)At least the loader should keep a cache of IDs loaded, and if the same ID is loaded in a second set then it should report. This will make investigating https://github.com/geneontology/amigo/issues/240 easier