Closed seanshahkarami closed 6 years ago
We may also want to consider compressing the data as part of the process to further reduce space and data serving size.
Among other things, we may want to add an auxiliary table to Cassandra tracking the last time a dataset was updated. This could make syncing and rebuilds much more efficient.
I had a chance to finish a prototype after work this evening. I think I'm pretty happy with the performance and think it's worth moving this forward if everyone's onboard.
One possible improvement we can do is build a static version of beehive which is regenerated on a schedule. This would dramatically improve page serving performance across the board and give us some room to add sanitization to the datasets until we've cleaned inconsistencies up.
This also has the side effect of completely eliminating direct database access for datasets to the outside work and so could eliminate any security mistakes which show up. (Even though, this really shouldn't be a problem...)
I think this is still worth prototyping, even though we now have nginx performing caching and have moved off the development server. As an example, the build-index tool in the data-exporter generates a "friendly" summary of all the datasets to make sure things look reasonable.