Open JenMysoc opened 8 years ago
@JenMysoc the email from PMG says "we suggest that [Hansard on People’s Assembly] links directly to the PMG website Hansard" — I read that as saying we should replace the link in the header bar:
to go to https://pmg.org.za/hansards/ instead of http://www.pa.org.za/hansard, rather than to change our own scraping?
I raised this with them - this change would break all of the person linking etc on the person pages - rescraping the preferred option
Yeah sorry that was my bad wording. What I meant was re-scraping the hansard from PMG's website instead of scraping from the official site as PMG's database is more up to date.
Do we know if this is effectively the same format of documents as before, and we'd just be changing where we get them from, but could run the existing parser against them? If they'd be a different format, then this is a substantially larger job.
I have a todo item to look at this very thing - realistically going to be tomorrow now though as lots of meetings today
So the documents are in different formats - https://pmg.org.za/hansard/21320/
vs.
http://www.parliament.gov.za/live/commonrepository/Processed/20151207/613254_1.doc
Though there is an API - https://api.pmg.org.za/hansard/
I suspect - but don't know - that PMG receive the files in the same word format as previously published on the Parliament site
From PMG: You can find [examples of] this at https://pmg.org.za/hansards/ Just 2 points to clarify: Code4SA will be making cosmetic changes so there are two columns to show NA and NCOP (EPC will fall under NA; Joint Sittings (JS) will appear in both NA and NCOP) Also, note how we provide an agenda-title for each hansard debate. We often make changes to both agenda-title and file (as more content of a plenary debate becomes available) so one needs to have a system that recognises updating of title and file. As Code4SA needs to update this cosmetically perhaps delay this until they finalise the formatting
Current way they edit the hansard:
I suspect - but don't know - that PMG receive the files in the same word format as previously published on the Parliament site
Finding out of that is true or not is likely going to be key to how big a job this will be.
Yup - we need to see the actual files that they receive in
They have sent through examples of the documents in word document format to our central address.
Jen and I were chatting about this yesterday - the simplest solution would be to see if PMG could persuade parliament to start republishing the Hansards online
The official website hasn't been updated since February but PMG have someone who is compiling the Hansard for them so their records are more up to date. It would therefore be good to switch the scraping to the PMG hansard.