Closed bradfordcondon closed 6 years ago
I count 9 GO annotations, resulting in 12 links to term GO: 0051179 (which is not directly linked).
Term on our site: https://hardwoodgenomics.org/cv/lookup/GO/0051179 Term on EBI: https://www.ebi.ac.uk/QuickGO/term/GO:0051179
here's the really bizarre thing: this term's parent is biological process: GO:0008150 . Guess how many entries entity 132373 has for it? Four!
If youre curious, this entity has 200 entries in the linker table (all GO).
So the fix I added in #42 makes sure the indexer doesn't insert the same record more than once. It seems to work on my local dev and I am not really sure how much slower it will be but it shouldn't be that bad given that we have a queue system in place. Also note that the PR does not add the constraints on the db table which might not be absolutely necessary since only the indexer inserts data to the table and it already does the checks.
ok great. agreed it will be hard to test: Ill look at code, merge, and rerun on live ince thats where i see the problem and we wont be breaking any functionality.
not resolved with #42 unfortunately. when run on live.
i created a test case on the dupe branch that you can run . takes me ~1 minute to run. You need to load the GOSLIM OBO first.
/**
* Checks of a row already exists.
*
* @param $row
*
* @return bool
*/
public function exists($row) {
$query = db_select('tripal_cvterm_entity_linker', 'TCEL');
foreach ($row as $field => $value) {
var_dump($field);
var_dump($value);
$query->condition($field, $value);
}
var_dump($query->countQuery()->execute()->fetchField());
return ((int) $query->countQuery()->execute()->fetchField()) > 0;
}
so for some reason this query is always 0. Dumping each condition gives me strings of stuff like this:
string(9) "entity_id"
string(5) "34065"
string(9) "cvterm_id"
string(4) "2939"
string(8) "database"
string(2) "GO"
string(9) "accession"
string(7) "0003676"
string(1) "0"
string(9) "entity_id"
string(5) "34065"
string(9) "cvterm_id"
string(4) "2939"
string(8) "database"
string(2) "GO"
string(9) "accession"
string(7) "0003676"
string(1) "0"
string(9) "entity_id"
string(5) "34065"
string(9) "cvterm_id"
string(4) "2943"
string(8) "database"
string(2) "GO"
string(9) "accession"
string(7) "0003723"
string(1) "0"
You can see that evne in this string of var_dumps you have the same entity/cvterm being associated.
It seems clear to me why: insertData
only inserts ONCE, at the end. So, this exists
check will always be false, i think. We need to either
I pushed a new set of changes that guarantees no duplication happens. The new changes have passed all your unit tests. I also did manual testing of entity insert hook by creating a new entity. It would be nice to have a unit test to test the hooks though and since the publish method in TripalTestSuite does trigger the hook, the same method used to test dupes should work. We'd only need to remove the part where you manually trigger the handle method of the indexer.
agreed. it actually doesnt matter which method we confirm doesnt have duplicates (or, we could test both).
EIther way, the work is almost done to have test coverage of both the hooks and the job so I'm pretty pleased/ think it was worth the time to make the case. Let's push it over the finish line and explicitly test each case, hook and index job.
Implemented a unit test for hook entity insert (see https://github.com/statonlab/tripal_cv_xray/blob/hooks_and_prev_indexing/tests/XrayIndexerJobTest.php).
Implemented a unit test for the update hook (same file as above). Everything is pushed and ready to merged.
@bradfordcondon I'll create the PR for your review.
should be resolved with #47
I no longer see duplicate records!
8454705 55006 GO 0005515 132369
8454706 54785 GO 0005488 132369
8454707 62730 GO 0006812 132373
8454708 61452 GO 0016021 132373
8454709 60971 GO 0019829 132373
8454710 61061 GO 0005524 132373
8454711 59369 GO 0004012 132373
8454712 57769 GO 0015914 132373
8454713 55264 GO 0000287 132373
8454714 55036 GO 0000166 132373
8454715 55265 GO 0046872 132373
8454716 54725 GO 0008150 132373
8454717 68400 GO 0071840 132373
8454718 62114 GO 0016043 132373
8454719 62851 GO 0061024 132373
8454720 69121 GO 0097035 132373
8454721 75988 GO 0034204 132373
8454722 59372 GO 0045332 132373
8454723 84923 GO 0065007 132373
8454724 56299 GO 0065008 132373
8454725 73928 GO 0051179 132373
8454726 62728 GO 0051234 132373
8454727 60753 GO 0006810 132373
8454728 62797 GO 0071702 132373
8454729 56279 GO 0015748 132373
8454730 59370 GO 0005548 132373
8454731 56278 GO 0015711 132373
8454732 57888 GO 0006869 132373
8454733 60878 GO 0005319 132373
8454734 62780 GO 0055085 132373
8454735 66121 GO 0034220 132373
8454736 64047 GO 0098655 132373
8454737 54841 GO 0008324 132373
8454738 60755 GO 0015075 132373
8454739 67821 GO 0022853 132373
8454740 68180 GO 0042625 132373
8454741 67769 GO 0022857 132373
8454742 64225 GO 0022804 132373
it is pretty nuts how many records there are for a single entity. 132373 has 77 terms associated! 9 direct GO associations.
looks reasonable. ill update it and index live as well.
select * from the index where db/accession = something
for example:
this is from the cvterm_entity_linker table on live. Here's the corresponding entity