Representative personal_url is blank

opennorth / represent-canada

Point or postcode to electoral district service for Canada, its provinces and municipalities

http://represent.opennorth.ca/

MIT License

65 stars 9 forks source link

Representative personal_url is blank #126

Closed icolwell closed 1 year ago

icolwell commented 2 years ago

Hi There,

I noticed the personal_url field described here always appears to be an empty string. Here's an example: https://represent.opennorth.ca/representatives/?format=apibrowser

In the past, this field used to be populated. Does anyone know what changed? Thanks!

jpmckinney commented 2 years ago

Hi @icolwell, it became too demanding to fill in that field, so it is now empty. The old spiders used to merge in data from other sources. They now only collect data from the House of Commons website.

icolwell commented 2 years ago

Hi @jpmckinney, thanks for the quick reply!

Would you be able to point me to the old spiders that collected this data? I'm interested in either reviving the personal_url field, or at least learning how to get the same data for my own purposes.

Thanks!

jpmckinney commented 2 years ago

Hi, looking at https://github.com/opencivicdata/scrapers-ca/blob/5dc5627fd586e91d41d9d23c5acaedf846d935f1/ca/people.py, I think maybe the change was by the House of Commons website. They used to provide links to MPs' personal websites, but that seems to no longer be the case.

Until 2020 http://politwitter.ca kept a separate database of politicians' URLs, etc.

I think you would have to scrape each party's website, going through their list of MPs to get each website.

icolwell commented 2 years ago

@jpmckinney, thanks for the hints! It seems like the personal URLs are still stored somewhere on their site, since they seem available on the contact tab.

I'm not familiar with the lxml library, but maybe we just need to update that scraper to use personal_url = mp_page.xpath('.//a[contains(@title, "Website")]/@href') instead of personal_url = mp_page.xpath('.//a[contains(@title, "Personal Web Site")]/@href')

It seems like the format of the page changed a bit from "Personal Web Site" to simply "Website".

icolwell commented 1 year ago

I remembered that the wayback machine exists and found out that the personal websites were indeed missing and added back with the new title "Website".

Here there is no website link: 2020/06/12 Here there is one: 2020/08/07

This guy has had one since 2019, seems like they get added at different times.

jpmckinney commented 1 year ago

Aha, that should work. Can you open a pull request?

icolwell commented 1 year ago

Done.

jpmckinney commented 1 year ago

Thanks! Now deployed. Should appear in API within 24h.

icolwell commented 1 year ago

Hmm, I'm still not seeing anything for the personal_url field.

Here's an example query: https://represent.opennorth.ca/representatives/?format=apibrowser&limit=50

jpmckinney commented 1 year ago

Had you tested your changes locally? Only thing that occurs to me is that it fails to extract the URL.

jpmckinney commented 1 year ago

Hmm, yeah, so the HTML code looks like (https://www.ourcommons.ca/Members/en/simon-pierre-savard-tremblay(104944)):

                                <h4>Website</h4>
                                <p>
                                    <a href="http://www.spst.quebec">http://www.spst.quebec</a>
                                     (in French only)
                                </p>

But your code does:

personal_url = mp_page.xpath('.//a[contains(@title, "Website")]/@href')

So that's not going to extract it.

icolwell commented 1 year ago

Ah ok, no, I mentioned in the PR description that I didn't test it. I figured someone more familiar with the code base could test it way faster than I could :smile:

Do you have any ideas on what would work? If not, I'll try to get some of this stuff installed locally, and learn about how the lxml syntax is supposed to work.

jpmckinney commented 1 year ago

You'd have to use a XPath selector that finds the h4 and then gets the next sibling, and looks inside it for @href.