openaustralia / planningalerts

Find out and have your say about what's being built and knocked down in your area.
https://www.planningalerts.org.au
Other
95 stars 51 forks source link

Create Popolo data for the councillors we have in morph.io #896

Closed henare closed 8 years ago

henare commented 8 years ago

From #885:

We had hoped to use PopIt for managing councillor information, alas that's deprecated. The main advantage we wanted from PopIt (whether it could do it or not) was to allow anyone to contribute information. This is because collecting all the councillor details and keeping them up to date is a massive job.

EveryPolitician has emerged out of PopIt and it seems like a similar model could work for us. Having the councillor data in Popolo means it's open and reusable, and that's nice.

EveryPolitician generally gets its information from scrapers on morph.io. We have a bunch of scrapers for councillor information but these generally don't have email addresses and are also quite messy.

So we need an easy way for people to add to this information and tidy it up. Google Sheets is the obvious choice for people to collaborate on CSV data like this. A key part of EveryPolitician is Tony's CSV to Popolo converter and this could presumably be used on Google's CSV export.

So the process would be to manually import data from those scrapers into one or more Google Sheets. We could then open up editing (to anyone? can you easily revert changes?) and occasionally we run csv_to_popolo to convert that into Popolo and store it somewhere (GitHub like EveryPolitician?).

henare commented 8 years ago

Our goal for this issue is to get all of the data we've already collected on morph.io into Popolo that we can import into PlanningAlerts. This does not include states that we haven't scraped yet or email addresses we don't already have. They should be created as further issues once this is done.

equivalentideas commented 8 years ago

We could then open up editing (to anyone? can you easily revert changes?)

You can restore and it has super detailed revision history :ok_hand:

equivalentideas commented 8 years ago

I've had a quick look at the Popolo standard, EveryPolitician and the csv-popolo importer.

I'm gonna pull one of the scraper CSVs into a Google Spreadsheet and look at what we need to change to make this Popolo data.

equivalentideas commented 8 years ago

Made a Google Spreadsheet for us: https://docs.google.com/spreadsheets/d/1_Ea99E5yXnHXW62o_lRo9khdbccEWfttpy2tyuYZYOE/edit#gid=2055188575

We currently have the headers:

For Popolo we want minimum I think we need:

In the future we might also like to add area to relate the member to an Local Government Area.

equivalentideas commented 8 years ago

id and group_id (highly recommended by the converter)

So the csv_popolo converter creates an id like "jane_hats" from name: "Jane Hats". As it warns in the readme this wont do anything special for you if there are two people with the same name.

Why don't we address this when we have the problem, but I'll make a note:

equivalentideas commented 8 years ago

I'm gonna try and run the data we have for a state through the converter.

equivalentideas commented 8 years ago

The importer works quite smoothly, but it's interpreting all the councils as partys. As the docs say:

Popolo allows for very complex modelling of roles and posts. Here, however, we optimise for the most-common case: a legislator being associated with a single political party/faction, possibly representing a given region/constituency.

This isn't our case. We have: a legislator (Councillor) being associated with a single council authority.

So I think the next step is to adjust the csv_popolo converter to allow us to input the councils as organizations that aren't parties.

equivalentideas commented 8 years ago

So I think I've adapted the converter in as dumb a way as possible to now work for our councillor data. https://github.com/equivalentideas/csv_to_popolo

And, using it I've converted NSW to Popolo https://github.com/equivalentideas/australian_local_councillors_popolo/blob/master/nsw_local_councillor_popolo.json

I've just done the councillors name, the council the are at, and their executive positions if they have any. We don't have party data yet, and I haven't included council websites at this point.

Gonna see how this works for our other states now.

equivalentideas commented 8 years ago

Worked well for QLD, Vic and NSW https://github.com/equivalentideas/australian_local_councillors_popolo

Our data for South Australia doesn't work as is. The names are in the 'last, first' format which the converter doesn't do anything about.

equivalentideas commented 8 years ago

Our data for South Australia doesn't work as is–

Looking more carefully at this data there are some strange errors in it, like there's 700+ councillors assigned to Adelaide Hills.

I'm currently running a fresh scrape to see if it's something that's been fixed in the parse api.

equivalentideas commented 8 years ago

Yep so the data on https://morph.io/openaustralia/sa_lg_councillors is wrong in places. I've forked it and am running a new version to get fresh data which I've tested removes these errors.

I also added to the scraper https://morph.io/equivalentideas/sa_lg_councillors a method to get those councillor names in the right format.

equivalentideas commented 8 years ago

I've now done SA and uploaded to Github https://github.com/equivalentideas/australian_local_councillors_popolo

equivalentideas commented 8 years ago

From above:

id and group_id (highly recommended by the converter) So the csv_popolo converter creates an id like "jane_hats" from name: "Jane Hats". As it warns in the readme this wont do anything special for you if there are two people with the same name.

Why don't we address this when we have the problem, but I'll make a note:

  • [x] check if there are different councillors with the same name

So it turns out there are councillors with the same name in different locations.

equivalentideas commented 8 years ago

So it turns out there are councillors with the same name in different locations.

If you view the popolo for NSW you'll find barry_johnston is the id of a single person who is a councillor at two councils. This should actually be two different people.

The simplest was to deal with this I can think for is for us to pre-populate the spreadsheet with our own ids. These could just sequential numbers as far as I understand.

I think this will be simplest if I combine the different sheets for each state into one big sheet. Then I can easy add a unique id column for everyone.

henare commented 8 years ago

The simplest was to deal with this I can think for is for us to pre-populate the spreadsheet with out own ids. These could just sequential numbers as far as I understand.

I'm sure EveryPolitician has this problem too - what do they do?

equivalentideas commented 8 years ago

I'm sure EveryPolitician has this problem too - what do they do?

They appear to get id's from their data sources when they scrape most of the time. Our data sources don't have ids for councillors unfortunately.

Do you have any advice to offer on this @tmtmtmtm :wave: ? Have you had this problem of people with the same name when you don't get ids in the scrape?

tmtmtmtm commented 8 years ago

In retrospect I think having the CSV-to-Popolo simply fail if there wasn't an ID would have been better — so in general we try very hard to construct ids in the scraper even when there isn't anything obvious in the source, unless it's completely obvious that there won't be duplicates. (And even when there is an ID in the source, we're increasingly hesitant to use it as-is, unless we're sure that those IDs won't be re-used over time, as it seems lots of places do… so, for example, for Mexico we needed to explicitly prefix the 2015 ids to avoid clashes with the IDs used in the previous term)

This is largely something that is best done at data level, when you presumably actually have the ability to know which people are the same or different, rather than the conversion level. In your case, I'd suggest either constructing an artificial ID out of name+council (though of course that's just punting the problem further down the line until you have two of those…) or tweaking the scraper itself to generate IDs, presumably based on some logic around whether seeing the same name twice in a council should create a new ID or assume it's the same as the last one.

(BTW the csv-to-popolo library has accumulated quite a lot of complexity along the way, and there are lots of un(der)documented features/gotchas. I'm happy to try to tidy some of that up on request, or answer questions about weird or unexpected behaviour, or suggest ways of adapting it for different scenarios — so far I've been the only known direct user of it, so I haven't been as diligent as I should have been on some fronts…)

equivalentideas commented 8 years ago

Thanks @tmtmtmtm that’s really helpful!

This is largely something that is best done at data level, when you presumably actually have the ability to know which people are the same or different, rather than the conversion level. In your case, I'd suggest either constructing an artificial ID out of name+council (though of course that's just punting the problem further down the line until you have two of those…)

I think that sounds like a good thing for me to try next.

(BTW the csv-to-popolo library has accumulated quite a lot of complexity along the way, and there are lots of un(der)documented features/gotchas. I'm happy to try to tidy some of that up on request, or answer questions about weird or unexpected behaviour, or suggest ways of adapting it for different scenarios — so far I've been the only known direct user of it, so I haven't been as diligent as I should have been on some fronts…)

I've been working with it in here and have done some dumb extensions for our specific use case. It took me a little while to get my head around it at first and understand how to run it and the tests etc. , but aside from that I've found it quite easy to work in. If I add anything useful for the project I'll open a PR.

I think having the CSV-to-Popolo simply fail if there wasn't an ID would have been better

I think this would have been helpful, as I only realised we were losing people to this issue after checking the original data for duplicates.

Thanks again Tony, I really appreciate it :star2:

equivalentideas commented 8 years ago

Last week I added id's to the NSW, SA and Victorian scrapers. They're now handling councillors with the same name correctly :) I've got to do the same for Queensland now.

equivalentideas commented 8 years ago

:fireworks: We now have basic Popolo for NSW, SA, Victoria and QLD in over here :point_down: https://github.com/openaustralia/australian_local_councillors_popolo

henare commented 8 years ago

Totally amazing :zap: Great work @equivalentideas :metal: