Open Yvonne-Han opened 4 years ago
Double confirmed that speaker_data
doesn't have speaker_number == 0
entries so this should be an issue in our code when creating se_features
.
library(dbplyr, warn.conflicts = FALSE)
library(tidyr)
library(DBI)
library(tidyverse)
library(reprex)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")
speaker_data <- tbl(pg, "speaker_data")
word_counts <- tbl(pg, "word_counts")
tone_measure <- tbl(pg, "tone_measure")
fog_measure <- tbl(pg, "fog_measure")
word_counts %>%
filter(speaker_number == 0) %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 7459
word_counts %>%
filter(speaker_number == 0) %>%
anti_join(speaker_data) %>%
count()
#> Joining, by = c("file_name", "last_update", "speaker_name", "speaker_number", "context", "section")
#> # Source: lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 7459
Created on 2020-04-19 by the reprex package (v0.3.0)
I think you may want to check what the function returns when you run it on this call. If you create a Python Notebook in the directory where the files are then you should be able to import some_function from fog.fog
or something like that.
After trying a few things, I think this chunk of code below is probably the reason why speaker_number == 0
entries are created (same applies to other se_features
tables).
So the next step would be to figure out under which circumstances do we get len(speaker_data) == 0
.
When creating a list of files to be processed, we select file_name
and max(last_update)
from streetevents.calls
:
Therefore, whenever the max(last_update)
in calls
table is not matched with last_update
in speaker_data
table (the matching step is shown as below), it will lead to no speaker_data returned (i.e., len(speaker_data) == 0
).
@iangow I guess this issue can be solved by:
speaker_number == 0
entries (because if they were sitting in the word_counts
table, the code will not identify them as unprocessed calls and will skip them); and word_counts
table again (since that the latest update is now matched in calls
and speaker_data
). se_features
tables.Yes, I think that's the right approach. Maybe make an issue for each affected table (unless it's easy to handle all in one go). It should not take long to re-process the affected files.
Waiting for #23 to decide whether we want to change word_counts_run.py
before re-running it for these calls.
Waiting for #23 to decide whether we want to change
word_counts_run.py
before re-running it for these calls.
No need to wait for #23. Just run it again. We can come back to this after #23 and run again on the incremental calls (should be easy to do that then).
Waiting for #23 to decide whether we want to change
word_counts_run.py
before re-running it for these calls.No need to wait for #23. Just run it again. We can come back to this after #23 and run again on the incremental calls (should be easy to do that then).
Sure. Then I will run it again later today (probably tonight). It should be quick.
This comment is created to keep track of the progress:
Updated 2020-04-30 00:18:35 AEST:
word_counts
word_counts_run.py
Updated 2020-04-30 21:57:10 AEST:
tone_measure
tone_measure_run.py
Updated 2020-05-01 23:54:24 AEST:
fog_measure
fog_run.py
After deleting the speaker_number == 0
entries and re-running the code, the number of speaker_number == 0
entries has decreased (from ~7300 to ~1900), but there are still quite a few.
@iangow It seems that (one of) the issue(s) here is the records of these affected calls
can be found in streetevents.calls
table, but not in streetevents.speaker_data
table. See below:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(reprex)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO streetevents, se_features")
word_counts <- tbl(pg, "word_counts")
speaker_data <- tbl(pg, "speaker_data")
calls <- tbl(pg, "calls")
affected_calls <- word_counts %>%
filter(speaker_number == 0) %>%
select(file_name, last_update)
affected_calls %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 1939
affected_calls %>%
anti_join(speaker_data, by = "file_name") %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 1898
affected_calls %>%
anti_join(calls, by = "file_name") %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [yanzih1@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 0
Created on 2020-05-04 by the reprex package (v0.3.0)
@iangow When I was merging
word_counts
with speaker_data, I found that there werespeaker_number == 0
entries inword_counts
, which is really weird.As you can see below, I've tested this on different
se_features
tables and it seems that this issue can be found in multiple tables, e.g.,word_counts
,tone_measure
,fog_measure
, etc.Created on 2020-04-19 by the reprex package (v0.3.0)