Closed GoogleCodeExporter closed 9 years ago
I have a hunch that there may be 2 resources for the same UUID - one from one
provider, and the same resource in catalogue of life. If that is the case, we
need to fix content in registry DB. Please find UUIDs and liaise with Jose if
needed - he will need to know which DB it is.
Original comment by timrobertson100
on 29 Apr 2011 at 6:17
Indeed CLB tells me it is ITIS
clb.dataset says UUID: "2cffcaed-aa4f-4f02-8961-a14df9b3ac0a" which is coming
as part of the catalogue of life checklist.
the registry perhaps has 2 ITIS entries with the same UUID? Should only be 1
with that UUID
Original comment by timrobertson100
on 29 Apr 2011 at 6:21
in the registry there is 1 agent for ITIS with agent ID 1541
However, this is CORRECTLY tied to 2 providers:
mysql> select count(*) from agent_relation where to_agent_id=1541;
+----------+
| count(*) |
+----------+
| 2 |
+----------+
1 row in set (0.01 sec)
Could this be the source of the duplicate?
Original comment by timrobertson100
on 29 Apr 2011 at 6:26
UUID "2cffcaed-aa4f-4f02-8961-a14df9b3ac0a" has only 1 resource associated.
There are 2 relations but one of them is deleted. the sql should filter out
deleted!=null
Original comment by josecua...@gmail.com
on 29 Apr 2011 at 6:28
Agent relationships seem to be marked as deleted Lars:
mysql> select * from agent_relation where to_agent_id=1541;
+------+------------------+--------------+---------------+-------------+--------
-------------+---------------------+-------------------+---------------------+--
------+
| id | relation_type_id | has_accepted | from_agent_id | to_agent_id |
created | modified | verification_code | deleted
| direct |
+------+------------------+--------------+---------------+-------------+--------
-------------+---------------------+-------------------+---------------------+--
------+
| 1483 | 2 | 1 | 1639 | 1541 |
2011-03-15 15:45:42 | 2011-03-15 15:45:42 | NULL | NULL
| NULL |
| 1618 | 2 | 1 | 1689 | 1541 |
2011-04-15 17:21:08 | 2011-04-15 17:21:21 | EC2ZKK9 | 2011-04-15
17:24:33 | NULL |
+------+------------------+--------------+---------------+-------------+--------
-------------+---------------------+-------------------+---------------------+--
------+
2 rows in set (0.02 sec)
Is the registry sqoop out VIEW taking into account the "and delete is null" ?
That would be 1 potential source of duplication... if not, please discuss with
Jose - we're so close.....
Original comment by timrobertson100
on 29 Apr 2011 at 6:32
Deleted agents or relations are not considered at all. Sqoop can have an
optional WHERE clause that might help.
Original comment by lars.fra...@gmail.com
on 29 Apr 2011 at 6:39
No agent is duplicated in the registry, but there is indeed one case where one
single agent (Fishbase) is linked to 2 Organisations
agent_id=1561
It can be seen here:
http://gbrds.gbif.org/browse/agent?uuid=bb9bbebe-49d9-45a8-923b-69aab3757d05
Conceptually this is not wrong I think, but maybe could cause some issues on
the part of you guys.
Original comment by josecua...@gmail.com
on 29 Apr 2011 at 6:42
Perhaps just add the deleted clause in Hive?
e.g. in create_data_provider.q :
JOIN ${registry_agent_relation} rel ON org.id=rel.from_agent_id AND
rel.relation_type_id=2 AND rel.deleted IS NULL
Could be used in other places too though
Original comment by timrobertson100
on 29 Apr 2011 at 9:26
Adding
--where "deleted IS NULL"
to Sqoop doesn't work in our Oozie version. It's supported in the version
coming with CDH3u0 though.
We don't have a view for agent_relation so I'm just going to add another
predicate wherever it is needed and will try again.
Thanks for figuring this out you two!
Original comment by lars.fra...@gmail.com
on 30 Apr 2011 at 9:18
Perhaps the cleanest is to leave the WF as it is and then create 2 views on the
registry like you have on the portal?
This way we can add the sanitizing of \t etc, and also ignore all content that
is marked as deleted.
create or replace view sqoop_agent as
select
id,
clean(description)
...
from agent
where deleted is null;
-- we ignore deleted, and only care about the provides relationships
create or replace view sqoop_agent_relation as
select
*
from agent_relation
where deleted is null and relationship_type=2;
Benefits of the view approach:
- guarantees on sanitizing data
- quick to apply changes
- no need to redeploy WF for simple corrections
- cleaner interface, as "internal only" content (e.g. the deletion auditing) is
not exposed
This might be enough to warrant becoming the policy for all sqoop outs in the
future; use a view and consider the structure like you would when defining a
proper interface.
Original comment by timrobertson100
on 1 May 2011 at 6:21
Tried a run with manually altered Hive scripts (added the IS NULL predicate)
but that didn't help. Must have done something wrong or the problem lies
elsewhere.
Original comment by lars.fra...@gmail.com
on 2 May 2011 at 7:38
The only remaining problem are the FishBase taxons.
Jose says this is not wrong.
Any ideas on how to handle this? I fixed the other stuff by modifying our Hive
queries but pushing that out to views should be easy once this last problem is
fixed.
Original comment by lars.fra...@gmail.com
on 3 May 2011 at 5:43
FishBase is publishing a) an occurrence dataset and b) a checklist (which is in
the CoL). Those two are both represented as the same resource in the registry
(bb9bbebe-49d9-45a8-923b-69aab3757d05).
As they are two distinct things we need two resources for them and we'll change
CoL to point to this new resource instead in the registry and in the dataset
table of CLB.
For the future the registry might want to distinguish between "occurrence
resource", "checklist resource" and others and not just use one generic
"resource"
Original comment by lars.fra...@gmail.com
on 3 May 2011 at 6:34
oh well, Fishbase had 2 resources about 1 month ago (15 April), but we merged
them as it is now, after some discussions with Kyle and Markus. I will undo
these changes again.
Original comment by josecua...@gmail.com
on 3 May 2011 at 7:48
Changes have been undone. Now there's 2 resources again:
Fishbase (occurrence dataset)
http://gbrds.gbif.org/browse/agent?uuid=197908d0-5565-11d8-b290-b8a03c50a862
Fishbase (checklist)
http://gbrds.gbif.org/browse/agent?uuid=bb9bbebe-49d9-45a8-923b-69aab3757d05
The checklist resource never had any access point, that the way it was
registered.
Original comment by josecua...@gmail.com
on 3 May 2011 at 8:25
Duplicates have been fixed after ignoring deleted agents and relations as well
as separating the Fishbase resource into two.
Original comment by lars.fra...@gmail.com
on 5 May 2011 at 7:49
Original issue reported on code.google.com by
timrobertson100
on 29 Apr 2011 at 6:14