Closed icolwell closed 1 year ago
Hi @icolwell, it became too demanding to fill in that field, so it is now empty. The old spiders used to merge in data from other sources. They now only collect data from the House of Commons website.
Hi @jpmckinney, thanks for the quick reply!
Would you be able to point me to the old spiders that collected this data?
I'm interested in either reviving the personal_url
field, or at least learning how to get the same data for my own purposes.
Thanks!
Hi, looking at https://github.com/opencivicdata/scrapers-ca/blob/5dc5627fd586e91d41d9d23c5acaedf846d935f1/ca/people.py, I think maybe the change was by the House of Commons website. They used to provide links to MPs' personal websites, but that seems to no longer be the case.
Until 2020 http://politwitter.ca kept a separate database of politicians' URLs, etc.
I think you would have to scrape each party's website, going through their list of MPs to get each website.
@jpmckinney, thanks for the hints! It seems like the personal URLs are still stored somewhere on their site, since they seem available on the contact tab.
I'm not familiar with the lxml
library, but maybe we just need to update that scraper to use
personal_url = mp_page.xpath('.//a[contains(@title, "Website")]/@href')
instead of
personal_url = mp_page.xpath('.//a[contains(@title, "Personal Web Site")]/@href')
It seems like the format of the page changed a bit from "Personal Web Site" to simply "Website".
I remembered that the wayback machine exists and found out that the personal websites were indeed missing and added back with the new title "Website".
Here there is no website link: 2020/06/12 Here there is one: 2020/08/07
This guy has had one since 2019, seems like they get added at different times.
Aha, that should work. Can you open a pull request?
Done.
Thanks! Now deployed. Should appear in API within 24h.
Hmm, I'm still not seeing anything for the personal_url field.
Here's an example query: https://represent.opennorth.ca/representatives/?format=apibrowser&limit=50
Had you tested your changes locally? Only thing that occurs to me is that it fails to extract the URL.
Hmm, yeah, so the HTML code looks like (https://www.ourcommons.ca/Members/en/simon-pierre-savard-tremblay(104944)):
<h4>Website</h4>
<p>
<a href="http://www.spst.quebec">http://www.spst.quebec</a>
(in French only)
</p>
But your code does:
personal_url = mp_page.xpath('.//a[contains(@title, "Website")]/@href')
So that's not going to extract it.
Ah ok, no, I mentioned in the PR description that I didn't test it. I figured someone more familiar with the code base could test it way faster than I could :smile:
Do you have any ideas on what would work? If not, I'll try to get some of this stuff installed locally, and learn about how the lxml
syntax is supposed to work.
You'd have to use a XPath selector that finds the h4
and then gets the next sibling, and looks inside it for @href
.
Hi There,
I noticed the
personal_url
field described here always appears to be an empty string. Here's an example: https://represent.opennorth.ca/representatives/?format=apibrowserIn the past, this field used to be populated. Does anyone know what changed? Thanks!