opennorth / represent-canada-data

Digital electoral boundary files for Canada, its provinces and municipalities
http://represent.opennorth.ca/
Other
46 stars 17 forks source link

Site Scraper not grabbing address information correctly #46

Closed elynch303 closed 1 year ago

elynch303 commented 1 year ago

it looks like the site scraper is grabbing the address for some representatives and not others. for example take the endpoint below https://represent.opennorth.ca/postcodes/B0A1G0/?format=apibrowser in here you will see John white his office list did mange to pull the phone number form the url end point being https://nslegislature.ca/members/profiles/john-white but it failed to get the mailing address of the office witch is available form the same source.

        {
            "email": "Johnwhitemla@outlook.com",
            "name": "John White",
            "url": "https://nslegislature.ca/members/profiles/john-white",
            "personal_url": "",
            "party_name": "Progressive Conservative Association of Nova Scotia",
            "related": {
                "boundary_url": "/boundaries/nova-scotia-electoral-districts-2019/glace-bay-dominion/",
                "representative_set_url": "/representative-sets/nova-scotia-legislature/"
            },
            "source_url": "https://nslegislature.ca/members/profiles",
            "representative_set_name": "Nova Scotia House of Assembly",
            "last_name": "White",
            "first_name": "John",
            "gender": "",
            "extra": {},
            "offices": [
                {
                    "type": "constituency",
                    "tel": "1 902 849-8930"
                }
            ],
            "photo_url": "https://nslegislature.ca/sites/default/files/styles/photo_thumbnail/public/mla-thumbnails/john_White_I2098_0.jpg?itok=V7RcCpT2",
            "district_name": "Glace Bay-Dominion",
            "elected_office": "MLA"
        },

where as others like Amanda Mcdougall dose have her office mailing address available

 {
            "email": "mayor@cbrm.ns.ca",
            "name": "Amanda McDougall",
            "url": "",
            "personal_url": "",
            "party_name": "",
            "related": {
                "boundary_url": "/boundaries/census-subdivisions/1217030/",
                "representative_set_url": "/representative-sets/cape-breton-regional-council/"
            },
            "source_url": "http://www.cbrm.ns.ca/mayor",
            "representative_set_name": "Cape Breton Regional Council",
            "last_name": "McDougall",
            "first_name": "Amanda",
            "gender": "",
            "extra": {},
            "offices": [
                {
                    "type": "legislature",
                    "postal": "320 Esplanade - Suite 400",
                    "tel": "1 902 563-5000"
                }
            ],
            "photo_url": "https://www.cbrm.ns.ca/media/system/images/arrow.png",
            "district_name": "Cape Breton",
            "elected_office": "Mayor"
        },
jpmckinney commented 1 year ago

I would be happy to merge a pull request. The relevant file is in another repository: https://github.com/opencivicdata/scrapers-ca/blob/master/ca_ns/people.py

elynch303 commented 1 year ago

i tried to add this in a new branch so i could make an MR and it will not let me push im getting a permissions error

ERROR: Permission to opencivicdata/scrapers-ca.git denied to elynch303.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

the new branch. but this was the update for people.py in the ca_ns dir

import re

from utils import CanadianPerson as Person
from utils import CanadianScraper

COUNCIL_PAGE = "https://nslegislature.ca/members/profiles"

class NovaScotiaPersonScraper(CanadianScraper):
    PARTIES = {
        "Liberal": "Nova Scotia Liberal Party",
        "PC": "Progressive Conservative Association of Nova Scotia",
        "NDP": "Nova Scotia New Democratic Party",
        "Independent": "Independent",
    }

    def scrape(self):
        page = self.lxmlize(COUNCIL_PAGE)
        members = page.xpath(
            '//div[contains(@class, "view-display-id-page_mlas_current_tiles")]//div[contains(@class, "views-row-")]'
        )  # noqa
        assert len(members), "No members found"
        for member in members:
            district = member.xpath('.//div[contains(@class, "views-field-field-constituency")]/div/text()')[0]
            party = member.xpath('.//span[contains(@class, "party-name")]/text()')[0]

            if party == "Vacant":
                continue

            detail_url = member.xpath(".//@href")[0]
            detail = self.lxmlize(detail_url)

            name = detail.xpath('//div[contains(@class, "views-field-field-last-name")]/div/h1/text()')[0]
            name = re.sub(r"(Honourable |\(MLA Elect\)|\(New MLA Elect\))", "", name)
            party = self.PARTIES[party.replace("LIberal", "Liberal")]

            p = Person(primary_org="legislature", name=name, district=district, role="MLA", party=party)
            p.image = detail.xpath('//div[contains(@class, "field-content")]//img[@typeof="foaf:Image"]/@src')[0]

            contact = detail.xpath('//div[contains(@class, "mla-current-profile-contact")]')[0]

            address = contact.xpath("./p[2]")[0]
            address = address.text_content().strip().splitlines()
            address = list(map(str.strip, address))
            p.add_contact("address", "\n".join(address), "constituency")

            email = self.get_email(contact, error=False)
            if email:
                p.add_contact("email", email)
            p.add_contact("voice", self.get_phone(contact, area_codes=[902]), "constituency")

            p.add_source(COUNCIL_PAGE)
            p.add_source(detail_url)

            yield p

PS i also was not able to test this when i run pupa update ca_ns i keep getting

exception "cannot import name 'Mapping' from 'collections' (/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py)" prevented loading of pupa.cli.commands.update module
usage: pupa [-h] [--debug] [--loglevel LOGLEVEL] {init,dbinit} ...
pupa: error: argument subcommand: invalid choice: 'update' (choose from 'init', 'dbinit')
elynch303 commented 1 year ago

PS would want to update the other MPP /MLA (provincial government members address as well) so if the permissions to the repo are fixed i could add this as one larger MR

jpmckinney commented 1 year ago

The way it works on GitHub: You need to make a fork, push to your fork, and then make the pull request.