mysociety / pombola

GNU Affero General Public License v3.0
65 stars 41 forks source link

Some committees are duplicated #1216

Open geoffkilpin opened 10 years ago

geoffkilpin commented 10 years ago

Some (all?) committees exist twice in the database, e.g:

This means that on some profile pages a committee is listed more than once - e.g. http://za-pombola.staging.mysociety.org/person/charles-danny-kekana/

I am not sure of the source - was a scraper of Parliament's website ever added? The source of committee membership should be the PMG scraper.

I originally picked up on this on my local installation - so somewhere during the import of data the duplication is arising.

paullenz commented 10 years ago

So - there are some instances of two committees existing - I think that this could be because the scraper and and the original parliamentary info was entered and is (I believe, being manually managed by PMG) I think we need to consider which the canonical version is - my assumption is that the ones with the "portfolio" in the committee name are scraped - however they don't associate to an organisation (e.g. national assembly) so this would need to be tweaked - also the names are not a user friendly as the other set

geoffkilpin commented 10 years ago

The source of committee membership data should be the PMG website (e.g. http://www.pmg.org.za/committees/Communications) and ideally the PMG scraper should be ensuring that changes on the PMG website are mirrored on the PA site (if I recall correctly this currently isn't the case?).

Based on the list at http://www.pmg.org.za/committees I think that at least in this case 'Communications' is the correct committee (although I think the full name is preferable).

dracos commented 10 years ago

This is related to/ a duplicate of #878

paullenz commented 10 years ago

So to clarify - is http://za-pombola.staging.mysociety.org/organisation/portfolio-committee-on-communications/ scraper generated or the manually created?

If the former then we need it to associate the committee to an organisation

paullenz commented 10 years ago

Further update:

http://za-pombola.staging.mysociety.org/organisation/social-development/ http://za-pombola.staging.mysociety.org/organisation/portfolio-committee-on-social-development/

paullenz commented 10 years ago

I think that in the intestests of pragmatism, we should simply delete the dupe committees that have the least comprehensive membership information, and post launch look into the options for automated scraper-driven updating - if no-one objects massively then I will get on and do this

paullenz commented 10 years ago

@geoffkilpin is this still a live issue from your perspective?

paullenz commented 10 years ago

Just a ping to @geoffkilpin to see if this is still an ongoing concern

geoffkilpin commented 10 years ago

I've just been looking into this as PMG is looking to manually update the site to reflect changes to the committee structure. It seems that there are 4 committee organisation kinds:

The duplication seems to come between the first organisation kind and the other three. The suggestion that I have made to address this is to:

This can all be done manually. Is there anything which I might have missed?

mhl commented 10 years ago

Hi @geoffkilpin - thanks for looking into this - indeed, it's very confusing, and anything you can do to resolve that would be helpful. As well as the 4 committee OrganisationKinds that you mentioned, there are also:

I'm not sure if there's any duplication between those types and the others. I printed out all organisations of OrganisationKinds that match "committee", grouped by that kind, and any identifiers associated with them:

Those of kind "Committee" don't have any org.mysociety.za schema identifiers, which I think means that they were added after the initial data import. (Everything in the initial data import from the Popolo JSON that was based on the CSV files and scraping PMG had one of those identifiers, I believe.)

So, if that matches your understanding as well, I'm basically OK with your plan with some small suggestions:

Does that sound sensible to you?

geoffkilpin commented 10 years ago

Hi @mhl - many thanks for taking a look at this and for spotting the extra committees. I will discuss with PMG what to do about the provincial committees.

To respond specifically to your suggested changes to my plan:

On a slightly related note - as far as I can tell SlugRedirects are not created when a slug is edited (so I won't modify slugs when correcting names to be their official names), but might this perhaps be something worth adding at some point?

mhl commented 10 years ago

Hi @geoffkilpin - sure, I'm happy to go with whatever you think's best with regard to merging or not, based on data quality.

Yes, SlugRedirects really should be created on editing slugs. The support for slug redirection was intially very basic - I improved it quite a bit in this recent pull request but didn't get to doing that... I'll create a ticket for it now.

Incidentally, to correct my earlier comment and the potentially confusing gist, @dracos pointed out to me that the organisations of kind Committee did have org.mysociety.za identifiers in the original JSON - they're still in the database, but pointing to now deleted objects. I can't remember off-hand why this might have happened, but I don't think it's important for your proposed changes.

geoffkilpin commented 10 years ago

Thanks @mhl. I think the 'Committee' kind organisations were scraped from Parliament's website as the others are from the PMG site. I seem to recall Parliament's list was quite out of date which is why we went with PMG's, but I'll check all that when working out whether to merge or delete.

paullenz commented 8 years ago

I believe this has been resolved by consuming the API

mhl commented 6 years ago

I'm reopening this, because I don't think it ever was resolved in the way that Paul suggests - we're using the PMG API to find committee appearances, but memberships of those committees are still being maintained in the Pombola admin and there is still confusion over which committees to use due to these duplicates. @chrismytton is looking into this.

chrismytton commented 6 years ago

From looking at the data in the database and re-reading this thread it seems that the following actions are needed before we can close this ticket:

I think most of this can be done from the admin, so it might be worth explaining the situation to PMG and seeing if they can do some of the work needed.

We might also need to make some changes in the admin to make it more obvious which committees should be used as @mhl mentions above.

e.g. hiding non-current organisations from organisation kind views, not autocompleting them in the admin, perhaps adding a warning at the top of the admin page for an old organisation, etc.