hrbrmstr / newsflash

Tools to Work with the Internet Archive and GDELT Television Explorer in R
90 stars 9 forks source link

Error in quer_tv when not using default filter_network() #7

Open pssguy opened 7 years ago

pssguy commented 7 years ago

When I run this code slightly amended from from the blog post

library(newsflash)
library(ggalt)  
library(hrbrmisc) 
library(DT)
library(plotly)
library(tidyverse)
starts <- seq(as.Date("2015-01-01"), (as.Date("2017-01-26")-30), "30 days") # splitting into 30 day chunks 25
ends <- as.character(starts + 29)
ends[length(ends)] <- ""

pb <- progress_estimated(length(starts))  # from dplyr takes app 1min
emails <- map2(starts, ends, function(x, y) {
  pb$tick()$print()
  query_tv("clinton", "email,emails,server", timespan="custom", start_date=x, end_date=y, filter_network = "AFFNETALL") 
})

This appears in the console

|====                                                                                                        |  4% ~1 s remaining     
No results found
|========                                                                                                    |  8% ~5 s remaining     
No results found
|========================================================                                                    | 52% ~9 s remaining     
Error: lexical error: inside a string, '\' occurs before a character which it may not.
          h! `/xx tt4w`t2n`qt'' mnh! `_\8 tt4w`t2n`qt'' nz(l `-'8 tt4w
                     (right here) ------^
Click for sessionInfo ``` > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=English_Canada.1252 [2] LC_CTYPE=English_Canada.1252 [3] LC_MONETARY=English_Canada.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_Canada.1252 attached base packages: [1] stats graphics grDevices utils datasets [6] methods base other attached packages: [1] dplyr_0.5.0 purrr_0.2.2 [3] readr_1.0.0 tidyr_0.6.1 [5] tibble_1.2 tidyverse_1.1.1 [7] plotly_4.5.6.9000 DT_0.2 [9] hrbrmisc_0.2.0 fastmatch_1.1-0 [11] ggalt_0.4.0 ggplot2_2.2.1.9000 [13] newsflash_0.4.2 loaded via a namespace (and not attached): [1] Rcpp_0.12.9 lubridate_1.6.0 [3] lattice_0.20-34 assertthat_0.1 [5] digest_0.6.12 proj4_1.0-8 [7] psych_1.6.12 R6_2.2.0 [9] plyr_1.8.4 httr_1.2.1 [11] seleniumPipes_0.3.7 readxl_0.1.1 [13] lazyeval_0.2.0 curl_2.3 [15] extrafontdb_1.0 whisker_0.3-2 [17] Matrix_1.2-8 devtools_1.12.0 [19] extrafont_0.17 tidytext_0.1.2 [21] stringr_1.2.0 foreign_0.8-67 [23] htmlwidgets_0.8 munsell_0.4.3 [25] broom_0.4.2 modelr_0.1.0 [27] janeaustenr_0.1.4 base64enc_0.1-3 [29] mnormt_1.5-5 htmltools_0.3.5 [31] viridisLite_0.1.3 withr_1.0.2 [33] MASS_7.3-45 SnowballC_0.5.1 [35] grid_3.3.2 txtplot_1.0-3 [37] nlme_3.1-131 jsonlite_1.3 [39] Rttf2pt1_1.3.4 gtable_0.2.0 [41] DBI_0.6 magrittr_1.5 [43] formatR_1.4 scales_0.4.1 [45] tokenizers_0.1.4 KernSmooth_2.23-15 [47] stringi_1.1.2 reshape2_1.4.2 [49] xml2_1.1.1 ash_1.0-15 [51] RColorBrewer_1.1-2 tools_3.3.2 [53] forcats_0.2.0 hms_0.3 [55] maps_3.1.1 parallel_3.3.2 [57] colorspace_1.3-2 rvest_0.3.2 [59] memoise_1.0.0 knitr_1.15.1 [61] haven_1.0.0 ```
yeedle commented 7 years ago

There's an issue with the json returned for the timespan of 2015-12-27 - 2016-01-25 for your query. In other words, GDELT is returning invalid json.

hrbrmstr commented 7 years ago

Beat me to it, @Yeedle ;-) I compensated for some of this with https://github.com/hrbrmstr/newsflash/blob/master/R/newsflash.r#L131 (despite httr using similar methods, some of it's post-processing was causing other data loss) but the API has issues. If you do a similar query on the web site, do you get decent JSON after downloading? If so, I'm going to be almost stumped since this is just calling the same thing their browser clicky bits do.

pssguy commented 7 years ago

OK I tried removing that time-period with

starts <- starts[-13]
ends <- ends[-13]

pb <- progress_estimated(length(starts))  # from dplyr takes app 1min
emails <- map2(starts, ends, function(x, y) {
  pb$tick()$print()
  query_tv("clinton", "email,emails,server", timespan="custom", start_date=x, end_date=y, filter_network = "AFFNETALL") 

|==                                                                  |  4% ~5 m remaining     
No results found
|=====                                                               |  8% ~3 m remaining     
No results found
|====================================================================|100% ~0 s remaining     
> clinton_timeline <- map_df(emails, "timeline") #4836
Error: `x` must be a vector (not a NULL)

newsflasissue

I have not updated the package from yesterday

yeedle commented 7 years ago

hmm, that seems like an issue with map_df. Try clinton_timeline <- map_df(emails, ~.x[["timeline"]]) (I know it should be the same but it worked for me this way.

yeedle commented 7 years ago

Oh, I see now. The issue is that the first two lists in emails are null. Seems like map_df doesn't know how to deal with null lists when it's only provided a character as .f

pssguy commented 7 years ago

@Yeedle Thanks for alternative. It does work for me I'm not that well-informed on purrr so not quite sure what only provided a character as .f means Does this suggest a bug?

yeedle commented 7 years ago

@pssguy Not sure if it's a bug, but to me it's inconsistent behavior, unless there's something I missed about purrr. I filed it as an issue: https://github.com/tidyverse/purrr/issues/306