lindsaykatz / hansard-proj

Materials for the Digitization of the Australian Parliamentary Debates (1998-2022).
0 stars 1 forks source link

Question: joining on debate topic by page number #5

Closed nicholasamiller closed 5 months ago

nicholasamiller commented 7 months ago

Hi Lindsay, Thank for putting together this excellent project -- looks like a lot of painstaking work. I aim to get the debate topic for each row in the hansard-daily records. It looks like you recommend joining on page number with the debate topics. Intuitively to me this seems like it will not work because a topic can span multiple pages and a single page can span multiple topics. Your paper says this: Note that in some cases there were multiple page number child nodes for the same debate or sub-debate title, likely due to manual transcription error. Upon manual inspection, we found that most often, the second page number node contained the same page number as the first node, and sometimes the second node contained a repeated debate title or a timestamp. I inspected a few XML files and did not find this to be the case however. Is there something I am missing here? Is there a reason that the debate topic and subdebate topics in a parent XML element is not simply added as a column in the hansard-daily dataset? Thanks, Nick

lindsaykatz commented 5 months ago

Hi Nick,

Thank you very much for your message, and my apologies for the late reply!

Regarding the cases of multiple page number nodes we mention in the paper, I have found a couple examples to share. They are both from the 2005-05-12 Hansard transcript. You can see in the side-by-side screenshots below that there are two <page.no> nodes in each, and in one case the second node contains a sub-debate topic ("Biofuels"), and in another case the second one is a time stamp.

2005-05-12 Screenshot 2024-04-01 at 7 09 16 PM

We have not incorporated this variable directly into the dataset because the XML tree structure of Hansard debates makes it difficult to accurately attribute all correct statements and speeches to their corresponding debate topic, and this was not part of our initial parsing approach and code, which is why we decided to include it separately to supplement our main database.

You are correct that joining on page number with the debate topics is unfortunately not going to produce a one to one mapping of Hansard speeches and debate topics - we make note of this in the README. While this was outside the scope of this particular project, it is definitely a valuable area for future work!

I hope this helped to answer your questions, and please don't hesitate to reach out with any additional questions or comments.

Best, Lindsay