statonlab / tripal_cv_xray

view entities mapped onto CVs
1 stars 1 forks source link

index full of duplicates (no unique constraint?) #39

Closed bradfordcondon closed 6 years ago

bradfordcondon commented 6 years ago

select * from the index where db/accession = something

for example:

4227376 73928   GO  0051179 132373
4227377 73928   GO  0051179 132373
4227378 73928   GO  0051179 132373
4227379 73928   GO  0051179 132373
4227380 73928   GO  0051179 132373
4227381 73928   GO  0051179 132373
4227382 73928   GO  0051179 132373
4227383 73928   GO  0051179 132373
4227384 73928   GO  0051179 132373
4227385 73928   GO  0051179 132373
4227386 73928   GO  0051179 132373
4227387 73928   GO  0051179 132373

this is from the cvterm_entity_linker table on live. Here's the corresponding entity

bradfordcondon commented 6 years ago

I count 9 GO annotations, resulting in 12 links to term GO: 0051179 (which is not directly linked).

Term on our site: https://hardwoodgenomics.org/cv/lookup/GO/0051179 Term on EBI: https://www.ebi.ac.uk/QuickGO/term/GO:0051179

screen shot 2018-06-19 at 10 18 50 am

bradfordcondon commented 6 years ago

here's the really bizarre thing: this term's parent is biological process: GO:0008150 . Guess how many entries entity 132373 has for it? Four!

If youre curious, this entity has 200 entries in the linker table (all GO).

almasaeed2010 commented 6 years ago

So the fix I added in #42 makes sure the indexer doesn't insert the same record more than once. It seems to work on my local dev and I am not really sure how much slower it will be but it shouldn't be that bad given that we have a queue system in place. Also note that the PR does not add the constraints on the db table which might not be absolutely necessary since only the indexer inserts data to the table and it already does the checks.

bradfordcondon commented 6 years ago

ok great. agreed it will be hard to test: Ill look at code, merge, and rerun on live ince thats where i see the problem and we wont be breaking any functionality.

bradfordcondon commented 6 years ago

not resolved with #42 unfortunately. when run on live.

screen shot 2018-06-26 at 3 54 03 pm
bradfordcondon commented 6 years ago

i created a test case on the dupe branch that you can run . takes me ~1 minute to run. You need to load the GOSLIM OBO first.

  /**
   * Checks of a row already exists.
   *
   * @param $row
   *
   * @return bool
   */
  public function exists($row) {
    $query = db_select('tripal_cvterm_entity_linker', 'TCEL');

    foreach ($row as $field => $value) {
      var_dump($field);
      var_dump($value);

      $query->condition($field, $value);
    }

    var_dump($query->countQuery()->execute()->fetchField());

    return ((int) $query->countQuery()->execute()->fetchField()) > 0;
  }

so for some reason this query is always 0. Dumping each condition gives me strings of stuff like this:


string(9) "entity_id"
string(5) "34065"
string(9) "cvterm_id"
string(4) "2939"
string(8) "database"
string(2) "GO"
string(9) "accession"
string(7) "0003676"
string(1) "0"
string(9) "entity_id"
string(5) "34065"
string(9) "cvterm_id"
string(4) "2939"
string(8) "database"
string(2) "GO"
string(9) "accession"
string(7) "0003676"
string(1) "0"
string(9) "entity_id"
string(5) "34065"
string(9) "cvterm_id"
string(4) "2943"
string(8) "database"
string(2) "GO"
string(9) "accession"
string(7) "0003723"
string(1) "0"

You can see that evne in this string of var_dumps you have the same entity/cvterm being associated.

It seems clear to me why: insertData only inserts ONCE, at the end. So, this exists check will always be false, i think. We need to either

almasaeed2010 commented 6 years ago

I pushed a new set of changes that guarantees no duplication happens. The new changes have passed all your unit tests. I also did manual testing of entity insert hook by creating a new entity. It would be nice to have a unit test to test the hooks though and since the publish method in TripalTestSuite does trigger the hook, the same method used to test dupes should work. We'd only need to remove the part where you manually trigger the handle method of the indexer.

bradfordcondon commented 6 years ago

agreed. it actually doesnt matter which method we confirm doesnt have duplicates (or, we could test both).

EIther way, the work is almost done to have test coverage of both the hooks and the job so I'm pretty pleased/ think it was worth the time to make the case. Let's push it over the finish line and explicitly test each case, hook and index job.

almasaeed2010 commented 6 years ago

Implemented a unit test for hook entity insert (see https://github.com/statonlab/tripal_cv_xray/blob/hooks_and_prev_indexing/tests/XrayIndexerJobTest.php).

almasaeed2010 commented 6 years ago

Implemented a unit test for the update hook (same file as above). Everything is pushed and ready to merged.

@bradfordcondon I'll create the PR for your review.

bradfordcondon commented 6 years ago

should be resolved with #47

bradfordcondon commented 6 years ago

I no longer see duplicate records!

8454705 55006   GO  0005515 132369
8454706 54785   GO  0005488 132369
8454707 62730   GO  0006812 132373
8454708 61452   GO  0016021 132373
8454709 60971   GO  0019829 132373
8454710 61061   GO  0005524 132373
8454711 59369   GO  0004012 132373
8454712 57769   GO  0015914 132373
8454713 55264   GO  0000287 132373
8454714 55036   GO  0000166 132373
8454715 55265   GO  0046872 132373
8454716 54725   GO  0008150 132373
8454717 68400   GO  0071840 132373
8454718 62114   GO  0016043 132373
8454719 62851   GO  0061024 132373
8454720 69121   GO  0097035 132373
8454721 75988   GO  0034204 132373
8454722 59372   GO  0045332 132373
8454723 84923   GO  0065007 132373
8454724 56299   GO  0065008 132373
8454725 73928   GO  0051179 132373
8454726 62728   GO  0051234 132373
8454727 60753   GO  0006810 132373
8454728 62797   GO  0071702 132373
8454729 56279   GO  0015748 132373
8454730 59370   GO  0005548 132373
8454731 56278   GO  0015711 132373
8454732 57888   GO  0006869 132373
8454733 60878   GO  0005319 132373
8454734 62780   GO  0055085 132373
8454735 66121   GO  0034220 132373
8454736 64047   GO  0098655 132373
8454737 54841   GO  0008324 132373
8454738 60755   GO  0015075 132373
8454739 67821   GO  0022853 132373
8454740 68180   GO  0042625 132373
8454741 67769   GO  0022857 132373
8454742 64225   GO  0022804 132373

it is pretty nuts how many records there are for a single entity. 132373 has 77 terms associated! 9 direct GO associations.

https://hardwoods.ag.utk.edu/bio_data/132373

bradfordcondon commented 6 years ago

looks reasonable. ill update it and index live as well.