Open stephbuon opened 3 years ago
Here are some debate IDs for instances where the debate text ended up as the debate title:
@stephbuon check alexander's work
@stephbuon check out the kind of sentence ID assigned to each "returned" sentence.
I need to go back in and see how the sentences are being handled:
library(data.table)
library(tidyverse)
a <- fread("~/data/hansard_c19_improved_speaker_names_2.csv") %>%
select(sentence_id, speaker, new_speaker, text)
a <- a %>%
mutate(len = str_count(speaker))
data3 <- filter(a, speaker > 160)
Sometimes debate text ends up in the speaker field:
We need to find instances like this and fix them by: a) isolating the speaker name in its own column b) separating the text so that each sentence has its own row. (like our existing csv file)
I bet we can find these instances by checking to see if any speakers are more than, say, 30 letters long?