mysociety / caps

A simple, open database of local government climate action plan documents and emissions data.
https://cape.mysociety.org
Other
8 stars 2 forks source link

Create a Google Programmable Search Engine to find references to Scorecards project in council websites/minutes #649

Closed zarino closed 4 months ago

zarino commented 5 months ago

Being able to quickly find references to the Climate Action Scorecards in council meeting minutes would help us track real-world usage of the project by councillors / council officers.

Programmable Search Engines (previously Custom Search Engines, or CSEs) are Google’s way of creating a narrowed-down search corpus for a Google search. You can use a PSE to perform a search across a few hundred domains in a single go, without having to list every single domain as a site: in your query.

I’ve created a proof-of-concept PSE that searches across 17 councils’ websites. Here it is, performing a search for "climate action scorecards":

https://cse.google.com/cse?cx=a4085f652f7144aac#gsc.tab=0&gsc.q=%22climate%20action%20scorecards%22&gsc.sort=

The Annotations.xml file uploaded as part of the config for this PSE was:

<Annotations>
    <Annotation about="www.bcpcouncil.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.cambspboroca.org/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.folkestone-hythe.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.greatermanchester-ca.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.liverpoolcityregion-ca.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.northeastca.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.northnorthants.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.oadby-wigston.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="southyorkshire-ca.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.somersetwestandtaunton.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="teesvalley-ca.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="westmidlandscombinedauthority.org.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.westnorthants.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.westofengland-ca.org.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.westsuffolk.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.westyorks-ca.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.london.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.folkestone-hythe.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
    <Annotation about="www.somersetwestandtaunton.gov.uk/*">
        <Label name="_include_" />
    </Annotation>
</Annotations>

What we now need is a list of every domain that every council uses. Hmmm… 🤔

This won’t necessarily just be their user-facing top level domain. Looking at the URLs of plans in CAPE, its clear that some councils use a third-party CMS, at a different domain to their user-facing website, eg: baberghmidsuffolk.moderngov.co.uk/documents/… instead of www.babergh.gov.uk/…, bbcdevwebfiles.blob.core.windows.net/webfiles/… instead of www.bedford.gov.uk/….

The LGA used to publish a list of council domains, but stopped a few years ago. Sadly CAPE holds very few domains now.

We could potentially extract domains for councils from the plan URLs and net zero commitment URLs in CAPE. The advantage here is that this will likely include all those random third-party CMS domains that councils publish their stuff under, rather than their "official" domains. The disadvantage is that it might not include their official domains 😉

kipparker commented 5 months ago

I’ve done some work on extracting and checking all the urls from the scorecard data, there’s round 3,000 in there but if I filter for .gov.uk domains we get around 900, could be worth testing with that list? And maybe add in the moderngov.co.uk addresses (another 95)

zarino commented 5 months ago

@kipparker We might actually be able to use all 3000 – the only limits I can see on Programmable Search Engines are that the Annotations.xml file has to be less than 30KB, and you can only have 5000 annotations per engine. Maybe our 3000ish annotations would fit under the 30KB? 🤔

kipparker commented 5 months ago

@zarino definitely worth testing, I should be able to output the domains in the xml format quite easily, I'll be able to have a look later this week

kipparker commented 5 months ago

@zarino It does accept the whole list of urls, I created another search engine at https://cse.google.com/cse?cx=279b9e71f09d945f7 with the full list of domains. Results seem pretty good, eg. "climate action plan guildford" on custom search vs the same on google

zarino commented 4 months ago

CE UK tells us this is no longer required.