Open mhl opened 10 years ago
Hi,
Just to let you know, I'm having a go at this this. I've not written a scrapper before so it might take me a day or two. How do I submit the script?
Neil
I've had a bash at it. This is an example of the output I get at the moment:
<?xml version="1.0" encoding="utf-8"?>
<publicwhip>
<ques id="S4W-17949" speakerid="" speakername=" Jackson Carlaw, West Scotland, Scottish Conservative and Unionist Party" url="">To ask
the Scottish Government, further to the answer to question S4W-17358 by
John Swinney on 3 October 2013, how many of the Government Car Service
vehicles listed have been (a) purchased outright and (b) leased.</ques>
<reply id="S4W-17949" speakerid="" speakername=""></reply>
<ques id="S4W-17946" speakerid="" speakername=" Anne McTaggart, Glasgow, Scottish Labour" url="">To ask the Scottish
Government how many cases of female genital mutilation have been
reported by (a) midwives and (b) other health professionals, broken down
by (i) country of origin of patient and (ii) NHS board.</ques>
<reply id="S4W-17946" speakerid="" speakername=""></reply>
</publicwhip>
A few things I'm not sure about:
Do I just change the date. What about the nospeaker attribute?
This is the code I have up to now. Let me know if you'd like me to carry on with this or not. Not sure if it's any use to you
import urllib import urllib2 import re from bs4 import BeautifulSoup from xml.dom.minidom import Document import pprint import datetime
def scrape():
the_page = open("scrapetest.html", "r")
soup = BeautifulSoup(the_page, "html5lib")
parsedQuestions = []
for question in soup.findAll('tr', {'id' : re.compile('MAQA_Search_gvResults.*')}):
q = Question()
h = question.find('div', {'id' : re.compile('.*pnlQuestionHeader')})
if h:
q.header = h.find('span', {'id' : re.compile('MAQA_Search_gvResults.*')}).string
if q.header:
q.dateLodged = re.search('Date Lodged: (.*)', q.header).group(1)
#print "date : " + q.dateLodged
quesID = re.search('Question (.*?):', q.header)
if quesID:
q.quesID = quesID.group(1)
speakerName = re.search(':(.*), Date Lodged', q.header)
if speakerName:
q.speakerName = speakerName.group(1)
title = question.find('span', {'id' : re.compile('.*lblQuestionTitle')})
if title:
q.title = unicode(title.p.string)
replyText = question.find('span', {'id' : re.compile('.*lblAnswerText')})
if replyText:
q.replyText = unicode(replyText.p.string)
answeredBy = question.find('span', {'id' : re.compile('.*lblAnswerByMSP')})
if answeredBy:
q.answeredBy = unicode(answeredBy.string)
answerDate = question.find('span', {'id' : re.compile('.*lblAnswerDate')})
if answerDate:
q.answerDate = unicode(answerDate.string)
qStatus = question.find('span', {'id' : re.compile('.*lblQuestionStatus')})
if qStatus:
q.qStatus = unicode(qStatus.string)
parsedQuestions.append(q)
printXml(parsedQuestions)
def printXml(parsedQuestions):
for date, qList in groupByDate(parsedQuestions):
doc = Document()
base = doc.createElement('publicwhip')
doc.appendChild(base)
for q in qList:
#The question
ques = doc.createElement('ques')
ques.setAttribute("id", q.quesID)
ques.setAttribute("speakerid", "")
ques.setAttribute("speakername", q.speakerName)
ques.setAttribute("url", "")
base.appendChild(ques)
qText = doc.createTextNode(q.title)
ques.appendChild(qText)
#the reply
reply = doc.createElement('reply')
reply.setAttribute("id", q.quesID)
reply.setAttribute("speakerid", "")
reply.setAttribute("speakername", q.answeredBy)
base.appendChild(reply)
replyText = doc.createTextNode(q.replyText)
reply.appendChild(replyText)
date = datetime.datetime.strptime(date, '%d/%m/%Y')
newDate = date.strftime('%Y-%m-%d')
f = open(newDate + '.xml', 'w')
f.write( doc.toprettyxml(indent=" ", encoding="utf-8") )
f.close()
def groupByDate(questions):
groupedQs = {}
for q in questions:
if q.dateLodged in groupedQs:
groupedQs[q.dateLodged].append(q)
else:
groupedQs[q.dateLodged] = [q]
return groupedQs.iteritems()
class Question:
quesID = ""
dateLodged = ""
header = ""
speakerName = ""
title = ""
motionText = ""
answeredBy = ""
replyText = ""
answerDate = ""
qStatus = ""
if name == "main": scrape()
Just to let you know. I'm making progress on this. Off to Wales now. Will try to get it finished when I get back at the start of next week
@nrhorner: Thanks for this contribution, and sorry for my delay in replying about those points - TheyWorkForYou is a largely spare-time project for me at the moment, even though I'm working for @mysociety :smiley:
How to I get the speaker ID?
You can see an example of resolving names in pyscraper/sp/parse-official-reports-new.py
. That's probably over-the-top for what you need for the written answers, though - essentially you can, with the current directory being pyscraper/sp/
just import memberList
:
from resolvemembernames import memberList
... and get a list of possible speaker names in reponse to passing in a name and a date:
>>> memberList.match_whole_speaker('Alexander, Ms Wendy (Paisley North) (Lab)', '2010-05-12')
[u'uk.org.publicwhip/member/80281']
>>> memberList.match_whole_speaker('Wendy Alexander', '2010-05-12')
[u'uk.org.publicwhip/member/80281']
>>> memberList.match_whole_speaker('Joe Bloggs', '2013-11-04')
[]
Your next question was:
How to properly format ques and reply IDs. Do I jsut change the date and keep the url the same?
The url
attribute should be the URL that this information was scraped from - i.e. its original source. Question IDs are of the form:
uk.org.publicwhip/spwa/2011-03-17.S3W-40266.q0
... where the ISO8601 date is the date that the question was lodged (extracted from the scraped page), the SW3-40266
bit is the Scottish Parliament's ID for that question, and the q0
indicates that it's the first question for that ID. (Somtimes you get multiple questions rolled together and given one answer, in which case the questions' suffixes go q0
, q1
, etc. Replies to a question similarly have the suffix r0
, r1
, etc.
The <major-heading>
element typically is for a date-specific title, e.g. "Written Answers Thursday 17 March 2011", and has an ID of the form: 2011-03-17.0.mh
(the mh
suffix stands for "major heading", and the number beforehand should incremement for later major headings from that source. (Typically there's only one, though, I think.)
The <minor-heading>
element is for the subject of the question or questions, and is specific to a Scottish Parliament ID, e.g. uk.org.publicwhip/spwa/2011-03-17.S3W-40456.h
That's again the ISO8601 date that the question was lodged, followed by the Scottish Parliament ID, followed by an h
suffix (for "heading").
Not sure how to format the major and minor headings as in the below snippet [...] Do I just change the date. What about the nospeaker attribute?
Hopefully this is explained above. The major and minor headings for written questions / answers should always have nospeaker="True"
, but the <ques>
and <reply>
elements should have a speakername
attribute and if possible a speakerid
.
This is all based on the parlparse XML schema documentation here:
... see the written answers section.
I hope that's of some use - please let me know if that's unclear, it's been quite a while since I worked on the earlier version!
Great! Thanks for that. I'll get on with it this week.
@mhl: I'm making progress on this. Just a few things I need to know.
You mentioned that questions and answers sometimes get rolled-up into one. I can't find an example of this, so not sure how I would parse it.
Not sure where to get the minor heading subject text. For example here is a question:
Question S4W-01234: Willie Coffey, Kilmarnock and Irvine Valley, Scottish National Party, Date Lodged: 24/06/2011
To ask the Scottish Executive how much of the funding provided to (a) sportscotland and (b) sport governing bodies in each year since 1999 was awarded (i) without competition and (ii) following a competitive bid process.
Occasionally I get multiple possible IDs from resolvemembernames. What do I do then?
And finally. I can't get lxml.etree to write extra lines between elements as in the current xml files. Is this required? If so I can just manually create the XML as it's a very simple format.
I found in pyscraper/sp/parse-official-reports-new.py how to deal with multiple IDs
Hi @nrhorner - thanks for the update - I've tried to address your questions below.
It looks as though the Scottish Parliament website no longer lumps together questions, but attaches the same answer to each one. For example, here was an old example of two questions with one answer:
Now that appears as two questions in the search:
So I think you probably don't need to worry about that.
It seems as if there's no longer a useful title for the minor-heading text, so I'd just make it "Question S4W-01234".
It might be worth mentioning here some of the name and date pairings that you get multiple possible iDs for - it seems that they should really be unambiguous.
Finally, don't worry about the extra lines between elements - lxml.etree's pretty-printed output is fine.
@mhl: Issues above resolved thanks. I've issued a pull request for the script
Although we now have debates in the main chamber of the Scottish Parliament up on TheyWorkForYou again, another feature that was lost in the redesign of the parliament website were the written questions and answers.
Questions in the Scottish Parliament have a unique code, which can be just used in a URL - for example, SW4-1234 can be found here:
http://www.scottish.parliament.uk/parliamentarybusiness/28877.aspx?SearchType=Advance&ReferenceNumbers=S4W-1234
They can also be searched for by date range on this page:
http://www.scottish.parliament.uk/parliamentarybusiness/28877.aspx?SearchType=Advance
For reference, the XML generated by the old written answers parser is here:
http://ukparse.kforge.net/svn/parldata/scrapedxml/sp-written/