stephbuon / hansard-speakers

A data processing pipeline to disambiguate speakers in the 19th-century British Parliamentary debates.
MIT License
1 stars 1 forks source link

Fix instances where the debate text ends up in the speaker field #81

Open stephbuon opened 3 years ago

stephbuon commented 3 years ago

Sometimes debate text ends up in the speaker field:

mr swift macneill said he wished to call attention to the escape of two indentured chinese labourers from johannesburg to pretoria and their capture in the latter place when he referred to the matter upon a previous occasion the colonial secretary said that he had no information in regard to the imprisonment of these men naturally upon all questions affecting personal liberty irishmen sympathised with the victims of oppression and he wished to draw attention to the way the right hon gentleman had acted in the matter on friday last he asked the colonial secretary whether he had received the information which was conveyed to the public in a reuters telegram stating that two of these chinese labourers had escaped from the compounds at the mines and had managed to get from johannesburg to pretoria which was a distance of forty miles upon the point he did not think the colonial secretary had treated the house respectfully because hon members were anxious to have the fullest information in regard to the administration of the ordinance in south africa the right hon gentleman replied to him stating that he would not telegraph but he promised to communicate with lord milner by despatch he had not telegraphed and therefore it was the right hon gentlemans duty to have sent a despatch to lord milner by the mail which left for south africa at two oclock on saturday but there was something far worse connected with the mater and they had only to look at the ordinance itself in order to see the very stringent conditions under which these chinese laboured

We need to find instances like this and fix them by: a) isolating the speaker name in its own column b) separating the text so that each sentence has its own row. (like our existing csv file)

I bet we can find these instances by checking to see if any speakers are more than, say, 30 letters long?

stephbuon commented 3 years ago

Here are some debate IDs for instances where the debate text ended up as the debate title:

stephbuon commented 2 years ago

@stephbuon check alexander's work

stephbuon commented 2 years ago

@stephbuon check out the kind of sentence ID assigned to each "returned" sentence.

stephbuon commented 2 years ago

I need to go back in and see how the sentences are being handled:

library(data.table)
library(tidyverse)

a <- fread("~/data/hansard_c19_improved_speaker_names_2.csv") %>%
  select(sentence_id, speaker, new_speaker, text)

a <- a %>%
  mutate(len = str_count(speaker))

data3 <- filter(a, speaker > 160)