mysociety / pombola

GNU Affero General Public License v3.0
65 stars 41 forks source link

[ZA] Change hansard scraping location from ZA Government website to PMG website #1855

Open JenMysoc opened 8 years ago

JenMysoc commented 8 years ago

The official website hasn't been updated since February but PMG have someone who is compiling the Hansard for them so their records are more up to date. It would therefore be good to switch the scraping to the PMG hansard.

tmtmtmtm commented 8 years ago

@JenMysoc the email from PMG says "we suggest that [Hansard on People’s Assembly] links directly to the PMG website Hansard" — I read that as saying we should replace the link in the header bar:

screen shot 2015-12-14 at 22 49 53

to go to https://pmg.org.za/hansards/ instead of http://www.pa.org.za/hansard, rather than to change our own scraping?

paullenz commented 8 years ago

I raised this with them - this change would break all of the person linking etc on the person pages - rescraping the preferred option

JenMysoc commented 8 years ago

Yeah sorry that was my bad wording. What I meant was re-scraping the hansard from PMG's website instead of scraping from the official site as PMG's database is more up to date.

tmtmtmtm commented 8 years ago

Do we know if this is effectively the same format of documents as before, and we'd just be changing where we get them from, but could run the existing parser against them? If they'd be a different format, then this is a substantially larger job.

paullenz commented 8 years ago

I have a todo item to look at this very thing - realistically going to be tomorrow now though as lots of meetings today

paullenz commented 8 years ago

So the documents are in different formats - https://pmg.org.za/hansard/21320/

vs.

http://www.parliament.gov.za/live/commonrepository/Processed/20151207/613254_1.doc

Though there is an API - https://api.pmg.org.za/hansard/

I suspect - but don't know - that PMG receive the files in the same word format as previously published on the Parliament site

JenMysoc commented 8 years ago

From PMG: You can find [examples of] this at https://pmg.org.za/hansards/ Just 2 points to clarify: Code4SA will be making cosmetic changes so there are two columns to show NA and NCOP (EPC will fall under NA; Joint Sittings (JS) will appear in both NA and NCOP) Also, note how we provide an agenda-title for each hansard debate. We often make changes to both agenda-title and file (as more content of a plenary debate becomes available) so one needs to have a system that recognises updating of title and file. As Code4SA needs to update this cosmetically perhaps delay this until they finalise the formatting

Current way they edit the hansard:

image001

tmtmtmtm commented 8 years ago

I suspect - but don't know - that PMG receive the files in the same word format as previously published on the Parliament site

Finding out of that is true or not is likely going to be key to how big a job this will be.

paullenz commented 8 years ago

Yup - we need to see the actual files that they receive in

JenMysoc commented 8 years ago

They have sent through examples of the documents in word document format to our central address.

paullenz commented 8 years ago

Jen and I were chatting about this yesterday - the simplest solution would be to see if PMG could persuade parliament to start republishing the Hansards online