hrbrmstr / newsflash

Tools to Work with the Internet Archive and GDELT Television Explorer in R
90 stars 9 forks source link

amazing R tool! can we get intraday timestamps? #1

Open randomgambit opened 7 years ago

randomgambit commented 7 years ago

Hello @hrbrmstr, this is great!

I just wonder, is there any possibility to query the data at the intraday level? Or getting any sort of intraday timestamps?

Thanks!

randomgambit commented 7 years ago

for instance, could you aggregate the counts at the hour-level instead of the daily level? That would help match the data more precisely with data coming from other timezones.

hrbrmstr commented 7 years ago

except all I'm doing is calling anytime::anytime(date_start) (etc) and the result of that call is returning only day resolution. lemme look at the raw API values tho

hrbrmstr commented 7 years ago

20161212T000000Z 20161212T235959Z are examples of the start/end times for the timeline structure so you're out of luck there. but 20161221T050000Z is what comes back for show_date in the top_mactchs structure and anytime is not converting that properly so lemme see what i can do for at least that one.

randomgambit commented 7 years ago

thanks! if you dont find any workaround, then mailing the guy at GDELT can be a solution I guess

hrbrmstr commented 7 years ago

show_date in top_matches should have hms resolution now in 0.3.1 I just pushed. The others don't have such resolution.

dplyr::glimpse(df$top_matches)
## Observations: 1,000
## Variables: 8
## $ preview_url   <chr> "https://archive.org/details/FBC_20161223_140000_Varney__Company#start/...
## $ ia_show_id    <chr> "FBC_20161223_140000_Varney__Company", "CNNW_20161128_180000_Wolf", "FO...
## $ date          <date> 2016-12-23, 2016-11-28, 2016-12-27, 2016-12-23, 2016-12-20, 2016-11-29...
## $ station       <chr> "FOX Business", "CNN", "FOX News", "FOX Business", "FOX Business", "FOX...
## $ show          <chr> "Varney  Company", "Wolf", "FOX  Friends", "Varney  Company", "Making M...
## $ show_date     <dttm> 2016-12-23 14:00:00, 2016-11-28 18:00:00, 2016-12-27 11:00:00, 2016-12...
## $ preview_thumb <chr> "https://archive.org/download/FBC_20161223_140000_Varney__Company/FBC_2...
## $ snippet       <chr> "only at td ameritrade. the berlin terror suspect is debt. what else ha...
randomgambit commented 7 years ago

amazing! I am looking at your documentation and I am not sure what top_matches returns for a given request. For instance, If I search for hrbrmstr over 2015, what is then the output of top_matches? The days with the most counts?

hrbrmstr commented 7 years ago

That's a gd GDELT/Internet Archive TV search question. I'm assuming (from various testing) that is' the caption text from the top "n" (for large date ranges it maxes at 1K) out of all of the other possible ones it could return. You won't get more than that from the API tho.

randomgambit commented 7 years ago

thats great. Thanks again for your help. I ll play a bit with this for a while. But the raw data has to be somewhere, right?

hrbrmstr commented 7 years ago

It depends on what GDELT & IA put in their DB. You can clone the code and return the JSON before it gets processed and you'll see that the other structures don't have the resolution you want. Or go to the GUI web interface on their site generate CSVs and JSONs and validate there, too.

randomgambit commented 7 years ago

@hrbrmstr coincidence? http://blog.gdeltproject.org/television-explorer-hourly-timeline-boolean-or-and-increased-json-cap/

:D

randomgambit commented 7 years ago

but as you can see the data can only be downloaded over a 7 days period. It would be amazing if your package could take a date range as an input, break it down into slices of 7 days, download the data for each week and then combine everything into a tibble.

That way would allow everyone to recover the full intraday history. What do you think? Is that doable on your side?

Thanks again!

hrbrmstr commented 7 years ago

+100 for the heads' up on their API changes. #ty!!!

Step 1 was making it work with the new API changes ;-) Longer results were causing errors in httr so I had to remove it and use curl. Also, there are issues with the JSON being returned (embedded NULLs) in large result sets so I had to handle that as well.

Rather than have the main function intuit caller intentions, I'll probably add a helper function to do the date breaks as suggested IF they don't change their API again soon (I'll give them some time to let the dust settle on these changes)