Open iangow opened 3 years ago
@yiyangw2 I would suggest making a test notebook for this one. Provide some examples where the current function fails, and some examples where it works, then these can form tests for the new function.
It seems that the overwhelming majority of detected "years" are years. It might be worth looking at some of the cases that look like bad matches (e.g., 2017357
).
For details on ~
and regexp_matches
, see here.
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(),
bigint = "integer",
check_interrupts = TRUE)
rs <- dbExecute(pg, "SET work_mem TO '3GB'")
speaker_data <- tbl(pg, sql("SELECT * FROM streetevents.speaker_data"))
# Note that PostgreSQL doesn't have \b and using R means I need to "escape"
# backslashes. But it can be faster to process regexes in the database when
# possible.
year_regex <- "((?:199|20[01])\\d\\w*)\\W"
year_data <-
speaker_data %>%
filter(speaker_text %~% year_regex) %>%
mutate(years = regexp_matches(speaker_text, year_regex)) %>%
select(file_name, last_update, context, section, speaker_number, years) %>%
compute()
year_data %>%
mutate(year = unnest(years)) %>%
count(year) %>%
filter(n > 1) %>%
arrange(desc(n)) %>%
print(n = 50)
#> # Source: lazy query [?? x 2]
#> # Database: postgres [iangow@/tmp:5432/crsp]
#> # Ordered by: desc(n)
#> year n
#> <chr> <int>
#> 1 2019 304159
#> 2 2018 228285
#> 3 2010 211250
#> 4 2017 206699
#> 5 2015 206663
#> 6 2016 196445
#> 7 2014 194728
#> 8 2012 194355
#> 9 2011 192004
#> 10 2008 187504
#> 11 2009 184163
#> 12 2013 179339
#> 13 2007 155638
#> 14 2006 140269
#> 15 2005 128132
#> 16 2004 113174
#> 17 2003 88523
#> 18 2002 50922
#> 19 2000 36374
#> 20 1995 33795
#> 21 2001 24760
#> 22 1999 7419
#> 23 1998 5592
#> 24 1990s 5556
#> 25 1997 4183
#> 26 2000s 3695
#> 27 1996 3413
#> 28 1990 3265
#> 29 1994 2661
#> 30 1992 2282
#> 31 1993 2197
#> 32 1991 1884
#> 33 2010s 89
#> 34 2013s 69
#> 35 2017s 53
#> 36 2014s 50
#> 37 2012s 46
#> 38 2016s 41
#> 39 2015s 41
#> 40 2019s 40
#> 41 2018s 38
#> 42 2011s 28
#> 43 20000 27
#> 44 20009 22
#> 45 2011and 17
#> 46 2009s 15
#> 47 20022 15
#> 48 2007s 14
#> 49 2017357 14
#> 50 20111 13
#> # … with more rows
Created on 2021-04-12 by the reprex package (v2.0.0)
So 2017357
is part of CK-2017357
, which is a product name (not a year!).
You might find it easier to use compute(name = "year_data", temporary = FALSE)
and then switch to Python (should be small enough to use pd.read_sql
if you use WHERE
to filter out rows).
library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(stringr)
library(tidytext)
pg <- dbConnect(RPostgres::Postgres(),
bigint = "integer",
check_interrupts = TRUE)
rs <- dbExecute(pg, "SET work_mem TO '3GB'")
speaker_data <- tbl(pg, sql("SELECT * FROM streetevents.speaker_data"))
# Note that PostgreSQL doesn't have \b and using R means I need to "escape"
# backslashes. But it can be faster to process regexes in the database when
# possible.
year_regex <- "((?:199|20[01])\\d\\w*)\\W"
year_data <-
speaker_data %>%
filter(speaker_text %~% year_regex) %>%
mutate(years = regexp_matches(speaker_text, year_regex)) %>%
select(file_name, last_update, context, section, speaker_number, years) %>%
compute()
word_to_check <- "2017357"
temp <-
year_data %>%
mutate(year = unnest(years)) %>%
filter(year == word_to_check) %>%
inner_join(speaker_data) %>%
collect()
#> Joining, by = c("file_name", "last_update", "context", "section", "speaker_number")
temp %>%
unnest_sentences("sentence", speaker_text) %>%
filter(str_detect(sentence, word_to_check)) %>%
mutate(context = str_extract(sentence,
str_c(".{0,50}", word_to_check, ".{0,50}"))) %>%
select(context)
#> # A tibble: 16 x 1
#> context
#> <chr>
#> 1 "age programs, including omecamtiv mecarbil and ck-2017357, which we'll refe…
#> 2 "ate from our skeletal muscle activator program ck-2017357 or ck-357; one in…
#> 3 "age programs, including omecamtiv mecarbil and ck-2017357, which we'll refe…
#> 4 "age programs, including omecamtiv mecarbil and ck-2017357, which we'll refe…
#> 5 "ur most recent developments in connection with ck-2017357, which we'll refe…
#> 6 "omere activator program and our drug candidate ck-2017357, or ck-357, which…
#> 7 "hase iia evidence of effect clinical trials of ck-2017357 or ck-357; one in…
#> 8 "and as fady has described, our compound ck-2017357, or we will just call it…
#> 9 "ctives on the phase iia clinical trial data of ck-2017357 in patients with …
#> 10 "covery and advancement of omecamtiv as well as ck-2017357."
#> 11 " the agenda; first speaking to a rationale for ck-2017357 in als, dr."
#> 12 "ities, next steps for our clinical program for ck-2017357, which we'll refe…
#> 13 "speaking, as ritu pointed out, to our compound ck-2017357."
#> 14 "carbil and tirasemtiv, formerly referred to as ck-2017357."
#> 15 "adoption of tirasemtiv as the generic name for ck-2017357."
#> 16 "l development of tirasemtiv, formerly known as ck-2017357, for the potentia…
Created on 2021-04-12 by the reprex package (v2.0.0)
For example, run this code:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(),
bigint = "integer",
check_interrupts = TRUE)
rs <- dbExecute(pg, "SET work_mem TO '3GB'")
speaker_data <- tbl(pg, sql("SELECT * FROM streetevents.speaker_data"))
# Note that PostgreSQL doesn't have \b and using R means I need to "escape"
# backslashes. But it can be faster to process regexes in the database when
# possible.
year_regex <- "((?:199|20[01])\\d\\w*)\\W"
rs <- dbExecute(pg, "DROP TABLE IF EXISTS year_data_temp")
year_data <-
speaker_data %>%
filter(speaker_text %~% year_regex) %>%
mutate(years = regexp_matches(speaker_text, year_regex)) %>%
select(file_name, last_update, context, section, speaker_number, years) %>%
compute(name = "year_data_temp", temporary = FALSE)
Created on 2021-04-12 by the reprex package (v2.0.0)
Then, this part could be done in Python:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(stringr)
library(tidytext)
pg <- dbConnect(RPostgres::Postgres(),
bigint = "integer",
check_interrupts = TRUE)
rs <- dbExecute(pg, "SET work_mem TO '3GB'")
speaker_data <- tbl(pg, sql("SELECT * FROM streetevents.speaker_data"))
year_data<- tbl(pg, "year_data_temp")
word_to_check <- "2011and"
temp <-
year_data %>%
mutate(year = unnest(years)) %>%
filter(year == word_to_check) %>%
inner_join(speaker_data) %>%
collect()
#> Joining, by = c("file_name", "last_update", "context", "section", "speaker_number")
temp %>%
unnest_sentences("sentence", speaker_text) %>%
filter(str_detect(sentence, word_to_check)) %>%
mutate(context = str_extract(sentence,
str_c(".{0,50}", word_to_check, ".{0,50}"))) %>%
select(context)
#> # A tibble: 17 x 1
#> context
#> <chr>
#> 1 we have posted and the outlook going forward into 2011and beyond.
#> 2 the book-to-bill ratios for 2011and 2012, what is our degree of confidence i…
#> 3 do you see that impact demand this year then into 2011and beyond?
#> 4 ke to do if i can is start with a quick review of 2011and then finish up bef…
#> 5 is that we could potentially see well activity in 2011and i wouldn't expect …
#> 6 p brands and products for strong growth in fiscal 2011and beyond.
#> 7 timated levels of revenue in q4 and the full year 2011and ebitda margin for …
#> 8 these statements speak only as of april 21, 2011and pnc undertakes no obliga…
#> 9 competition will be there throughout 2011and beyond.
#> 10 y, i would love to get an idea that as we go into 2011and hopefully start se…
#> 11 leases, your lease business, is up for renewal in 2011and 2012, and is it mo…
#> 12 ectivity's second-quarter results for fiscal year 2011and our updated outloo…
#> 13 we'll grow reserves in 2011and increase production over 2010 levels.
#> 14 ou for attending eastplat's conference call on q4 2011and year 2011 financia…
#> 15 l report on form 10-k for the year ended june 30, 2011and other filings, par…
#> 16 ur employees focused on delivering our results in 2011and into 2012, but obv…
#> 17 very positive results for 2011and a lot of positive momentum at national pen…
Created on 2021-04-12 by the reprex package (v2.0.0)
It seems that 2011and
is picking up years, as each case should read "2011 and". Not much one can do with these … there will be errors in a database this large.
Thanks a lot for your explanation! That makes sense, we should take 2011and as a year (not a number). By the way, I was wondering which R package has the function of unnest_sentences. I tried stringr, tidyverse, and lexRankr, but it does not work.
tidytext
we should take 2011and as a year (not a number).
We shouldn't adapt the function too much for specific cases. I guess we might want to exclude any case where letters are involved from number_count
. So for now don't add 2011and
to the "is_year
" regex.
Thanks a lot for your explanation! I begin to see the tradeoff. Being exact has its costs. BTW, tidytext works!
I find that 2020 is a not a rare case. So should we also include it?
def number_count(raw):
""" Function to count the number of numbers appearing in a
passage of text.
"""
results = re.findall(r'\b(?<=-)?[,0-9\.]+(?=\s)', raw)
results = [result for result in results
if not re.match(r'(199|20[012])\d', result)
and re.search(r'[0-9]', result)]
return len(results)
The frequency 2020 is caught by your code is more than that of 2019.
I find that 2020 is a not a rare case. So should we also include it?
def number_count(raw): """ Function to count the number of numbers appearing in a passage of text. """ results = re.findall(r'\b(?<=-)?[,0-9\.]+(?=\s)', raw) results = [result for result in results if not re.match(r'(199|20[012])\d', result) and re.search(r'[0-9]', result)] return len(results)
Yes. Perhaps look at r'(199|20[0123])\d'
just in case firms are talking about 2030s already.
Yes, I checked and as you can see, they are often decent amount. I checked their content and I think we should use r'(199|20[0123])\d'
I am also thinking about making changes to the count number function. Currently, (1) it takes 20000 as a year (becasue '20000' contains 2000), and (2) does not count a number that appears at the end of a text unless the number is followed by a black. I make the following changes:
_Originally posted by @yiyangw2 in https://github.com/iangow/ling_features/issues/4#issuecomment-817965428_