srvarey / gbif-occurrencestore

Automatically exported from code.google.com/p/gbif-occurrencestore
0 stars 0 forks source link

Duplicate taxon_concepts #29

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Sqoop out fails due to duplicate taxon concept PKs

The IDs are coming from CLB derived TCs, and they start at ID: 102456963

A sample of a CLB derived that is duplicated (ID: 102456963):

102456963   7000    3807442 2   12850   104970998   103939980   103940417   103940418   103940797
    103940810   103964007   104970998   1   0   NULL    100 0   NULL    NULL    NULL
102456963   7000    3807442 355 12847   104970998   103939980   103940417   103940418   1039407
97  103940810   103964007   104970998   1   0   NULL    100 0   NULL    NULL    NULL

So the fields causing the duplicate are DATA_PROVIDER_ID and DATA_RESOURCE_ID

Need to investigate if these are duplicated in the CLB, or if the registry join 
is joining to 2 records hence causing it.  I suspect the latter.

Need to check name_usage id of 102456963 and find the registry UUID and then 
see what that brings back from the registry sqooped in tables.

Original issue reported on code.google.com by timrobertson100 on 29 Apr 2011 at 6:14

GoogleCodeExporter commented 9 years ago
I have a hunch that there may be 2 resources for the same UUID - one from one 
provider, and the same resource in catalogue of life.  If that is the case, we 
need to fix content in registry DB.  Please find UUIDs and liaise with Jose if 
needed - he will need to know which DB it is.

Original comment by timrobertson100 on 29 Apr 2011 at 6:17

GoogleCodeExporter commented 9 years ago
Indeed CLB tells me it is ITIS

clb.dataset says UUID: "2cffcaed-aa4f-4f02-8961-a14df9b3ac0a" which is coming 
as part of the catalogue of life checklist.

the registry perhaps has 2 ITIS entries with the same UUID?  Should only be 1 
with that UUID 

Original comment by timrobertson100 on 29 Apr 2011 at 6:21

GoogleCodeExporter commented 9 years ago
in the registry there is 1 agent for ITIS with agent ID 1541

However, this is CORRECTLY tied to 2 providers:

mysql> select count(*) from agent_relation where to_agent_id=1541;
+----------+
| count(*) |
+----------+
|        2 | 
+----------+
1 row in set (0.01 sec)

Could this be the source of the duplicate?

Original comment by timrobertson100 on 29 Apr 2011 at 6:26

GoogleCodeExporter commented 9 years ago
UUID "2cffcaed-aa4f-4f02-8961-a14df9b3ac0a" has only 1 resource associated. 

There are 2 relations but one of them is deleted. the sql should filter out 
deleted!=null

Original comment by josecua...@gmail.com on 29 Apr 2011 at 6:28

GoogleCodeExporter commented 9 years ago
Agent relationships seem to be marked as deleted Lars:

mysql> select * from agent_relation where to_agent_id=1541;
+------+------------------+--------------+---------------+-------------+--------
-------------+---------------------+-------------------+---------------------+--
------+
| id   | relation_type_id | has_accepted | from_agent_id | to_agent_id | 
created             | modified            | verification_code | deleted         
    | direct |
+------+------------------+--------------+---------------+-------------+--------
-------------+---------------------+-------------------+---------------------+--
------+
| 1483 |                2 |            1 |          1639 |        1541 | 
2011-03-15 15:45:42 | 2011-03-15 15:45:42 | NULL              | NULL            
    |   NULL | 
| 1618 |                2 |            1 |          1689 |        1541 | 
2011-04-15 17:21:08 | 2011-04-15 17:21:21 | EC2ZKK9           | 2011-04-15 
17:24:33 |   NULL | 
+------+------------------+--------------+---------------+-------------+--------
-------------+---------------------+-------------------+---------------------+--
------+
2 rows in set (0.02 sec)

Is the registry sqoop out VIEW taking into account the "and delete is null" ?

That would be 1 potential source of duplication...  if not, please discuss with 
Jose - we're so close.....

Original comment by timrobertson100 on 29 Apr 2011 at 6:32

GoogleCodeExporter commented 9 years ago
Deleted agents or relations are not considered at all. Sqoop can have an
optional WHERE clause that might help.

Original comment by lars.fra...@gmail.com on 29 Apr 2011 at 6:39

GoogleCodeExporter commented 9 years ago
No agent is duplicated in the registry, but there is indeed one case where one 
single agent (Fishbase) is linked to 2 Organisations

agent_id=1561

It can be seen here:
http://gbrds.gbif.org/browse/agent?uuid=bb9bbebe-49d9-45a8-923b-69aab3757d05

Conceptually this is not wrong I think, but maybe could cause some issues on 
the part of you guys. 

Original comment by josecua...@gmail.com on 29 Apr 2011 at 6:42

GoogleCodeExporter commented 9 years ago
Perhaps just add the deleted clause in Hive?

e.g. in create_data_provider.q :
JOIN ${registry_agent_relation} rel ON org.id=rel.from_agent_id AND 
rel.relation_type_id=2 AND rel.deleted IS NULL

Could be used in other places too though

Original comment by timrobertson100 on 29 Apr 2011 at 9:26

GoogleCodeExporter commented 9 years ago
Adding
--where "deleted IS NULL"
to Sqoop doesn't work in our Oozie version. It's supported in the version 
coming with CDH3u0 though.

We don't have a view  for agent_relation so I'm just going to add another 
predicate wherever it is needed and will try again.

Thanks for figuring this out you two!

Original comment by lars.fra...@gmail.com on 30 Apr 2011 at 9:18

GoogleCodeExporter commented 9 years ago
Perhaps the cleanest is to leave the WF as it is and then create 2 views on the 
registry like you have on the portal?

This way we can add the sanitizing of \t etc, and also ignore all content that 
is marked as deleted.

create or replace view sqoop_agent as
select
  id,
  clean(description)
  ...
from agent
where deleted is null;

-- we ignore deleted, and only care about the provides relationships
create or replace view sqoop_agent_relation as
select
  *
from agent_relation
where deleted is null and relationship_type=2;

Benefits of the view approach:
- guarantees on sanitizing data
- quick to apply changes
- no need to redeploy WF for simple corrections
- cleaner interface, as "internal only" content (e.g. the deletion auditing) is 
not exposed

This might be enough to warrant becoming the policy for all sqoop outs in the 
future; use a view and consider the structure like you would when defining a 
proper interface.

Original comment by timrobertson100 on 1 May 2011 at 6:21

GoogleCodeExporter commented 9 years ago
Tried a run with manually altered Hive scripts (added the IS NULL predicate) 
but that didn't help. Must have done something wrong or the problem lies 
elsewhere.

Original comment by lars.fra...@gmail.com on 2 May 2011 at 7:38

GoogleCodeExporter commented 9 years ago
The only remaining problem are the FishBase taxons.

Jose says this is not wrong.

Any ideas on how to handle this? I fixed the other stuff by modifying our Hive 
queries but pushing that out to views should be easy once this last problem is 
fixed.

Original comment by lars.fra...@gmail.com on 3 May 2011 at 5:43

GoogleCodeExporter commented 9 years ago
FishBase is publishing a) an occurrence dataset and b) a checklist (which is in 
the CoL). Those two are both represented as the same resource in the registry 
(bb9bbebe-49d9-45a8-923b-69aab3757d05).

As they are two distinct things we need two resources for them and we'll change 
CoL to point to this new resource instead in the registry and in the dataset 
table of CLB.

For the future the registry might want to distinguish between "occurrence 
resource", "checklist resource" and others and not just use one generic 
"resource"

Original comment by lars.fra...@gmail.com on 3 May 2011 at 6:34

GoogleCodeExporter commented 9 years ago
oh well, Fishbase had 2 resources about 1 month ago (15 April), but we merged 
them as it is  now, after some discussions with Kyle and Markus. I will undo 
these changes again.

Original comment by josecua...@gmail.com on 3 May 2011 at 7:48

GoogleCodeExporter commented 9 years ago
Changes have been undone. Now there's 2 resources again:

Fishbase (occurrence dataset)
http://gbrds.gbif.org/browse/agent?uuid=197908d0-5565-11d8-b290-b8a03c50a862

Fishbase (checklist)
http://gbrds.gbif.org/browse/agent?uuid=bb9bbebe-49d9-45a8-923b-69aab3757d05

The checklist resource never had any access point, that the way it was 
registered.

Original comment by josecua...@gmail.com on 3 May 2011 at 8:25

GoogleCodeExporter commented 9 years ago
Duplicates have been fixed after ignoring deleted agents and relations as well 
as separating the Fishbase resource into two.

Original comment by lars.fra...@gmail.com on 5 May 2011 at 7:49