iangow / ling_features

Functions for extracting commonly used linguistic features from text.
MIT License
11 stars 6 forks source link

Tweak number-count function to treat 20000 as a number #5

Open iangow opened 3 years ago

iangow commented 3 years ago

I am also thinking about making changes to the count number function. Currently, (1) it takes 20000 as a year (becasue '20000' contains 2000), and (2) does not count a number that appears at the end of a text unless the number is followed by a black. I make the following changes:

def number_count(doc):

  doc = re.sub('(?!=[0-9])(\.|,)(?=[0-9])', '', doc)
  doc = doc.translate(str.maketrans(string.punctuation, " " * len(string.punctuation)))

  doc = re.findall(r'\b[-+\(]?[$€£]?[-+(]?\d+\)?\b', doc)
  doc = [x for x in doc if not re.match(r'(199|20[01])\d{1}?\b', x)]
  return len(doc)

_Originally posted by @yiyangw2 in https://github.com/iangow/ling_features/issues/4#issuecomment-817965428_

iangow commented 3 years ago

@yiyangw2 I would suggest making a test notebook for this one. Provide some examples where the current function fails, and some examples where it works, then these can form tests for the new function.

iangow commented 3 years ago

It seems that the overwhelming majority of detected "years" are years. It might be worth looking at some of the cases that look like bad matches (e.g., 2017357).

For details on ~ and regexp_matches, see here.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres(),
                bigint = "integer",
                check_interrupts = TRUE)

rs <- dbExecute(pg, "SET work_mem TO '3GB'")

speaker_data <- tbl(pg, sql("SELECT * FROM streetevents.speaker_data"))

# Note that PostgreSQL doesn't have \b and using R means I need to "escape"
# backslashes. But it can be faster to process regexes in the database when
# possible.
year_regex <- "((?:199|20[01])\\d\\w*)\\W"

year_data <-
  speaker_data %>%
  filter(speaker_text %~% year_regex) %>%
  mutate(years = regexp_matches(speaker_text, year_regex)) %>%
  select(file_name, last_update, context, section, speaker_number, years) %>%
  compute()

year_data %>% 
  mutate(year = unnest(years)) %>% 
  count(year) %>% 
  filter(n > 1) %>%
  arrange(desc(n)) %>%
  print(n = 50)
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres [iangow@/tmp:5432/crsp]
#> # Ordered by: desc(n)
#>    year         n
#>    <chr>    <int>
#>  1 2019    304159
#>  2 2018    228285
#>  3 2010    211250
#>  4 2017    206699
#>  5 2015    206663
#>  6 2016    196445
#>  7 2014    194728
#>  8 2012    194355
#>  9 2011    192004
#> 10 2008    187504
#> 11 2009    184163
#> 12 2013    179339
#> 13 2007    155638
#> 14 2006    140269
#> 15 2005    128132
#> 16 2004    113174
#> 17 2003     88523
#> 18 2002     50922
#> 19 2000     36374
#> 20 1995     33795
#> 21 2001     24760
#> 22 1999      7419
#> 23 1998      5592
#> 24 1990s     5556
#> 25 1997      4183
#> 26 2000s     3695
#> 27 1996      3413
#> 28 1990      3265
#> 29 1994      2661
#> 30 1992      2282
#> 31 1993      2197
#> 32 1991      1884
#> 33 2010s       89
#> 34 2013s       69
#> 35 2017s       53
#> 36 2014s       50
#> 37 2012s       46
#> 38 2016s       41
#> 39 2015s       41
#> 40 2019s       40
#> 41 2018s       38
#> 42 2011s       28
#> 43 20000       27
#> 44 20009       22
#> 45 2011and     17
#> 46 2009s       15
#> 47 20022       15
#> 48 2007s       14
#> 49 2017357     14
#> 50 20111       13
#> # … with more rows

Created on 2021-04-12 by the reprex package (v2.0.0)

iangow commented 3 years ago

So 2017357 is part of CK-2017357, which is a product name (not a year!).

You might find it easier to use compute(name = "year_data", temporary = FALSE) and then switch to Python (should be small enough to use pd.read_sql if you use WHERE to filter out rows).

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(stringr)
library(tidytext)

pg <- dbConnect(RPostgres::Postgres(),
                bigint = "integer",
                check_interrupts = TRUE)

rs <- dbExecute(pg, "SET work_mem TO '3GB'")

speaker_data <- tbl(pg, sql("SELECT * FROM streetevents.speaker_data"))

# Note that PostgreSQL doesn't have \b and using R means I need to "escape"
# backslashes. But it can be faster to process regexes in the database when
# possible.
year_regex <- "((?:199|20[01])\\d\\w*)\\W"

year_data <-
  speaker_data %>%
  filter(speaker_text %~% year_regex) %>%
  mutate(years = regexp_matches(speaker_text, year_regex)) %>%
  select(file_name, last_update, context, section, speaker_number, years) %>%
  compute()

word_to_check <- "2017357"

temp <- 
  year_data %>% 
  mutate(year = unnest(years)) %>%
  filter(year == word_to_check) %>% 
  inner_join(speaker_data) %>%
  collect()
#> Joining, by = c("file_name", "last_update", "context", "section", "speaker_number")

temp %>%
  unnest_sentences("sentence", speaker_text) %>%
  filter(str_detect(sentence, word_to_check)) %>%
  mutate(context = str_extract(sentence, 
                               str_c(".{0,50}", word_to_check, ".{0,50}"))) %>%
  select(context)
#> # A tibble: 16 x 1
#>    context                                                                      
#>    <chr>                                                                        
#>  1 "age programs, including omecamtiv mecarbil and ck-2017357, which we'll refe…
#>  2 "ate from our skeletal muscle activator program ck-2017357 or ck-357; one in…
#>  3 "age programs, including omecamtiv mecarbil and ck-2017357, which we'll refe…
#>  4 "age programs, including omecamtiv mecarbil and ck-2017357, which we'll refe…
#>  5 "ur most recent developments in connection with ck-2017357, which we'll refe…
#>  6 "omere activator program and our drug candidate ck-2017357, or ck-357, which…
#>  7 "hase iia evidence of effect clinical trials of ck-2017357 or ck-357; one in…
#>  8 "and as fady has described, our compound ck-2017357, or we will just call it…
#>  9 "ctives on the phase iia clinical trial data of ck-2017357 in patients with …
#> 10 "covery and advancement of omecamtiv as well as ck-2017357."                 
#> 11 " the agenda; first speaking to a rationale for ck-2017357 in als, dr."      
#> 12 "ities, next steps for our clinical program for ck-2017357, which we'll refe…
#> 13 "speaking, as ritu pointed out, to our compound ck-2017357."                 
#> 14 "carbil and tirasemtiv, formerly referred to as ck-2017357."                 
#> 15 "adoption of tirasemtiv as the generic name for ck-2017357."                 
#> 16 "l development of tirasemtiv, formerly known as ck-2017357, for the potentia…

Created on 2021-04-12 by the reprex package (v2.0.0)

iangow commented 3 years ago

For example, run this code:

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres(),
                bigint = "integer",
                check_interrupts = TRUE)

rs <- dbExecute(pg, "SET work_mem TO '3GB'")

speaker_data <- tbl(pg, sql("SELECT * FROM streetevents.speaker_data"))

# Note that PostgreSQL doesn't have \b and using R means I need to "escape"
# backslashes. But it can be faster to process regexes in the database when
# possible.
year_regex <- "((?:199|20[01])\\d\\w*)\\W"

rs <- dbExecute(pg, "DROP TABLE IF EXISTS year_data_temp")

year_data <-
  speaker_data %>%
  filter(speaker_text %~% year_regex) %>%
  mutate(years = regexp_matches(speaker_text, year_regex)) %>%
  select(file_name, last_update, context, section, speaker_number, years) %>%
  compute(name = "year_data_temp", temporary = FALSE)

Created on 2021-04-12 by the reprex package (v2.0.0)

Then, this part could be done in Python:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(stringr)
library(tidytext)

pg <- dbConnect(RPostgres::Postgres(),
                bigint = "integer",
                check_interrupts = TRUE)

rs <- dbExecute(pg, "SET work_mem TO '3GB'")

speaker_data <- tbl(pg, sql("SELECT * FROM streetevents.speaker_data"))

year_data<- tbl(pg, "year_data_temp")
word_to_check <- "2011and"

temp <- 
  year_data %>% 
  mutate(year = unnest(years)) %>%
  filter(year == word_to_check) %>% 
  inner_join(speaker_data) %>%
  collect()
#> Joining, by = c("file_name", "last_update", "context", "section", "speaker_number")

temp %>%
  unnest_sentences("sentence", speaker_text) %>%
  filter(str_detect(sentence, word_to_check)) %>%
  mutate(context = str_extract(sentence, 
                               str_c(".{0,50}", word_to_check, ".{0,50}"))) %>%
  select(context)
#> # A tibble: 17 x 1
#>    context                                                                      
#>    <chr>                                                                        
#>  1 we have posted and the outlook going forward into 2011and beyond.            
#>  2 the book-to-bill ratios for 2011and 2012, what is our degree of confidence i…
#>  3 do you see that impact demand this year then into 2011and beyond?            
#>  4 ke to do if i can is start with a quick review of 2011and then finish up bef…
#>  5 is that we could potentially see well activity in 2011and i wouldn't expect …
#>  6 p brands and products for strong growth in fiscal 2011and beyond.            
#>  7 timated levels of revenue in q4 and the full year 2011and ebitda margin for …
#>  8 these statements speak only as of april 21, 2011and pnc undertakes no obliga…
#>  9 competition will be there throughout 2011and beyond.                         
#> 10 y, i would love to get an idea that as we go into 2011and hopefully start se…
#> 11 leases, your lease business, is up for renewal in 2011and 2012, and is it mo…
#> 12 ectivity's second-quarter results for fiscal year 2011and our updated outloo…
#> 13 we'll grow reserves in 2011and increase production over 2010 levels.         
#> 14 ou for attending eastplat's conference call on q4 2011and year 2011 financia…
#> 15 l report on form 10-k for the year ended june 30, 2011and other filings, par…
#> 16 ur employees focused on delivering our results in 2011and into 2012, but obv…
#> 17 very positive results for 2011and a lot of positive momentum at national pen…

Created on 2021-04-12 by the reprex package (v2.0.0)

It seems that 2011and is picking up years, as each case should read "2011 and". Not much one can do with these … there will be errors in a database this large.

yiyangw2 commented 3 years ago

Thanks a lot for your explanation! That makes sense, we should take 2011and as a year (not a number). By the way, I was wondering which R package has the function of unnest_sentences. I tried stringr, tidyverse, and lexRankr, but it does not work.

iangow commented 3 years ago

tidytext

iangow commented 3 years ago

we should take 2011and as a year (not a number).

We shouldn't adapt the function too much for specific cases. I guess we might want to exclude any case where letters are involved from number_count. So for now don't add 2011and to the "is_year" regex.

yiyangw2 commented 3 years ago

Thanks a lot for your explanation! I begin to see the tradeoff. Being exact has its costs. BTW, tidytext works!

yiyangw2 commented 3 years ago

I find that 2020 is a not a rare case. So should we also include it?

def number_count(raw):
""" Function to count the number of numbers appearing in a
    passage of text.
"""
    results = re.findall(r'\b(?<=-)?[,0-9\.]+(?=\s)', raw)
    results = [result for result in results
        if not re.match(r'(199|20[012])\d', result)
              and re.search(r'[0-9]', result)]
    return len(results)
yiyangw2 commented 3 years ago

The frequency 2020 is caught by your code is more than that of 2019.

iangow commented 3 years ago

I find that 2020 is a not a rare case. So should we also include it?

def number_count(raw):
""" Function to count the number of numbers appearing in a
    passage of text.
"""
    results = re.findall(r'\b(?<=-)?[,0-9\.]+(?=\s)', raw)
    results = [result for result in results
        if not re.match(r'(199|20[012])\d', result)
              and re.search(r'[0-9]', result)]
    return len(results)

Yes. Perhaps look at r'(199|20[0123])\d' just in case firms are talking about 2030s already.

yiyangw2 commented 3 years ago

Yes, I checked and as you can see, they are often decent amount. I checked their content and I think we should use r'(199|20[0123])\d' image