reidpr / quac

QUAC ("quantitative analysis of chatter" or any related acronym you like) is a package for acquiring and analyzing social Internet content. Docs are online at http://reidpr.github.io/quac.
Apache License 2.0
68 stars 28 forks source link

Add CDC log transforms #110

Open gfairchild opened 8 years ago

gfairchild commented 8 years ago

We've received some (non-publicly available) CDC time series that include the number of hits per day per page per region. This directly relates to how the Wikipedia time series are provided (number of hits per hour per article per language). It therefore makes sense to transform the CDC data into the same SQLite format that the Wikipedia data get transformed into so that we can use tssearch to pull the time series efficiently.

gfairchild commented 8 years ago

I noticed one thing I'm doing differently in the way I'm handling the CDC logs vs. how the Wikipedia logs are handled. In the current year's CDC log database, I rename the file when new data are added. Because each file contains up to a year's worth of data, and because the file names contain the length, I decided to rename the file when new data are added so that tssearch doesn't return a ton of blank data for dates that don't have data yet. The Wiki data, however, don't do this and instead display the blank data.

@reidpr, how do you feel about this decision?

gfairchild commented 8 years ago

Actually, you can ignore my last comment. Turns out after some more testing that incrementally updating the length won't work because the data array won't be resized appropriately. I'm fixing my CDC processing code to mirror the Wikipedia processing code and just used a fixed length for each database (in the case of the CDC data, the length will be the number of days in each year).