magda-io / magda

A federated, open-source data catalog for all your big data and small data
https://magda.io
Apache License 2.0
513 stars 93 forks source link

Multiple instances of the same organisation #1530

Open aneesha09 opened 6 years ago

aneesha09 commented 6 years ago

Problem description

Multiple instances of same organisation but when you go in the number of datasets is consolidated.

Note: this is for the naive solution that just hides the problem - actually solving it well is in https://github.com/magda-io/magda/issues/1530

Problem reproduction steps

  1. Go to Organisations
  2. Search for "csiro"

Result: There are 3 organisations called CSIRO. One has a desc, two don't and have same contact details. When you click through to any, they show the total number of datasets as 71 which is sum of datasets associated with all of them (62+9+2 datasets)

Expected result: If the data sets are consolidated, the organisations should be consolidated into one and should show all the information i.e. description and contact information. (How will we handle it if there are multiple instances of same info? Maybe we first find the root cause of why we have three and how come they are consolidated for number of data sets?)

Screenshot / Design / File reference

csiro___data_gov_au organisations___data_gov_au csiro___data_gov_au

aneesha09 commented 6 years ago

Fix just the UI, UI will show it merged in one organisation

QuadTog commented 6 years ago

We need to resolve this issue before launch. Currently, when a user searches MAGDA for an organisation (such as CSIRO) they see:

CSIRO (69 Datasets)

Email: piers.dunstan@csiro.au Website: https://orcid.org/0000-0002-2568-5945 Phone: +61 3 62325382 No desc.

CSIRO (11 Datasets)

Email: donna.hayes@csiro.au No desc.

CSIRO (2 Datasets)

At the Commonwealth Scientific and Industrial Research Organisation (CSIRO), we shape the future. We do this by using science to solve real issues. Our research makes a difference to people, industry and the planet.

We ask, we seek, we solve. As Australia’s national science agency we’ve been pushing the edge of what’s possible for over 85 years - and we're not stopping now.

Since we started life as the Advisory Council of Science and Industry in 1916, we’ve advanced Australia with a range of inventions and innovations that have had significant positive impact on the lives of people around the world. No contact info.

We need a way to unify multiple duplicate organisations into a single org. Currently each time an org is updated, each portal can overwrite the 'master' organisation info. When we hit multiple portals at once it means we store each portal's version of CSIRO as a separate entity or rewrite the same entity over and over again. This is creating user confusion and difficulties when explaining why a user cannot change their info/contact information. For example, below are SOME of the portals being harvested for CSIRO org info:

aneesha09 commented 6 years ago

Hi @QuadTog,

Could you share examples of users confused by this, their exact queries would be helpful so we can understand the problem a bit better from the end user perspective. There is no easy fix for this and the usage data would really help us to find the best possible solution.

Also, would you explain why you think this is a showstopper? When is the launch? So we can see if we can schedule a fix.

Thanks

QuadTog commented 6 years ago

Hey Aneesha,

I sure can! So the issues seem to fall into two broad categories:

1) Owners/users confused by organisations and where/how to edit the details:

I am the Lead Communications Officer at the NSW Bureau of Health Information.

I am looking into the presentation of information and data from our organisation on your website.

In exploring the site, I have learned that information from individual organisations can be managed by the organisation through their own page on the site. ...

Are you able to advise how best to confirm BHI already has an account, and if so, how to edit the contact information. ... Submitted from: https://search.data.gov.au/

In this case, they're couldn't find where their account/org came from as it was being pulled from: https://data.nsw.gov.au/data/organization/about/bureau-of-health-information Unfortunately, they still haven't updated their contact info on data.NSW so we can't redirect any complaints.

2) Owners wishing to consolidate their organisations/contact info. I was chatting with the ABS yesterday and they were asking how to consolidate their organisation info one consistent view because users were unable to contact them and they're missing some metadata/sub-orgs.

The ABS has 253 consolidated datasets; 59 from DGA, 191 from WA and 3 from NSW. But depending on which dataset you access you'll see one of the following:

Data.gov.au image Data.gov.au - SubOrg image

Data.WA image

Data.NSW image

We believe this is an issue because it prevents us from constructing an ontological representation of organisations. Without proper definitions, properties and relationships between datasets, entities and domains we can't limit this complexity (which means as we harvest more sources we'll continue to encounter more duplication) let alone begin to semantically link data.

aneesha09 commented 6 years ago

Hi @QuadTog,

Thanks so much for these examples. The first one does not seem to be related to this issue since BHI does not show as multiple organisations on search.data.gov.au. You might want to develop an FAQ around how organisations can update their own info which doesn't happen on this instance anyway.

Regarding ABS' query, would you kindly provide the contact of the person and/or ask them to forward the user queries around being unable to contact them and about missing metadata and suborgs?

Thanks

gordjw commented 6 years ago

Hi @aneesha09,

I think we have two scenarios (ABS and Premier and Cabinet). 1) ABS - A search for ABS return 5 results, of which 3 should be grouped (actually all 5 should probably be, but let's look at sufficient solutions) 2) Premier and Cabinet - None of these search results should be grouped (all are from different jurisdictions), but there are several with identical names, which trigger the current naive matching.

So the minimum sufficient solution seems to me to be grouping by name and jurisdiction. Noting that URL/portal isn't the same as jurisdiction (e.g. the ABS on NSW data portal should still be federal jurisdiction, and grouped with ABS from data.gov.au).

This would cover both of the use cases above, and most others I can imagine. It will also expose where we have data quality issues (missing or wrong jurisdiction) vs an actual grouping problem in MAGDA.

Proposal: 1) Update the grouping strategy to be name and jurisdiction (not just name) 2) Group Organisations in the search results page, as well as the Org detail page

What do you think?

aneesha09 commented 6 years ago

@gordjw - Thanks for the insight. That's very helpful.

@AlexGilleran - Do you see any issues with Gordon's approach? Questions/thoughts?

AlexGilleran commented 6 years ago

It solves the problem for CKAN-derived datasets, but data.json doesn't have a field for jurisdiction, and I can't remember seeing it in any CSW datasets (although it might be catered for in the spec), so we'll still have to cater for them.

So for instance, ABS data from data.gov.au and data.nsw.gov.au (both CKAN) will have the right jurisdiction on them and get grouped together correctly (assuming they both use the same string!), but ABS datasets on data.act (e.g. https://www.data.act.gov.au/Health/ABS-Census/id85-krdd) will have to have some kind of default jurisdiction.

This might be near enough to be good enough though? At least I'm pretty sure it'd be an improvement? Hard to say without putting it all together. E.g. I can't actually find any ABS datasets on data.nsw or data.vic, but if their "jurisdiction" field was "Australia" instead of "Commonwealth of Australia" then we might be in an even worse situation :(

AlexGilleran commented 6 years ago

... rereading my comment, I don't think I made it clear that I still think doing the group-by-jurisdictions thing is a good idea and we should try it 😛 .

Just had a look around... there's actually not a lot of federal data on the state portals, so it's probably not going to be a massive problem accomodating them, although capturing the jurisdiction fields is only going to get us so far even with the CKAN ones - WA for instance doesn't have a jurisdiction field, and SA does but it sets it up as "Federal Government", and names its ABS "ABS - SA Data" anyway :(.

gordjw commented 6 years ago

Yeah, I was worried you might say something like that 😄

I still think the best way is to code for the ideal scenario (or at least the feasible version), and we treat data quality problems as data quality problems. IMO it's good that MAGDA is exposing this.

If you folks can get name + jurisdiction grouping working, we'll get in touch with the state portal admins to see if we can get everyone publishing jurisdiction (and consistently).

What config changes are required to start ingesting new CKAN metadata fields, or do we just get everything that's published?

AlexGilleran commented 6 years ago

We do technically ingest everything that's published but we'll need to do some code changes to actually make use of the jurisdiction, I'll make a new epic for it 👍

gordjw commented 6 years ago

Thanks @AlexGilleran!

aneesha09 commented 6 years ago

@gordjw - What about the scenario where jurisdiction on a dataset and the jurisdiction on an organisation don't match? Also this looks like a really big fix, do you have a view on an interim solution?

QuadTog commented 5 years ago

Amending user report (two orgs: data.gov.au/arcgis server):

"Hi Team

I have noticed an issue with the Logan data display on the new beta site. When I search for Logan City, I get 2 entries under Logan City Council as per the screen shot below.

Can you please investigate the issue and let me know when it can be fixed. "