Open gfairchild opened 8 years ago
I noticed one thing I'm doing differently in the way I'm handling the CDC logs vs. how the Wikipedia logs are handled. In the current year's CDC log database, I rename the file when new data are added. Because each file contains up to a year's worth of data, and because the file names contain the length, I decided to rename the file when new data are added so that tssearch
doesn't return a ton of blank data for dates that don't have data yet. The Wiki data, however, don't do this and instead display the blank data.
@reidpr, how do you feel about this decision?
Actually, you can ignore my last comment. Turns out after some more testing that incrementally updating the length won't work because the data array won't be resized appropriately. I'm fixing my CDC processing code to mirror the Wikipedia processing code and just used a fixed length for each database (in the case of the CDC data, the length will be the number of days in each year).
We've received some (non-publicly available) CDC time series that include the number of hits per day per page per region. This directly relates to how the Wikipedia time series are provided (number of hits per hour per article per language). It therefore makes sense to transform the CDC data into the same SQLite format that the Wikipedia data get transformed into so that we can use
tssearch
to pull the time series efficiently.