rfaulkner / wikipedia_user_metrics

Wikimedia Foundation E3 Team Analysis Code
Other
9 stars 5 forks source link

Hash only clean URLs - rebuild new requests off of stored data if not otherwise specified #28

Closed rfaulkner closed 11 years ago

rfaulkner commented 11 years ago

Currently the RESTful interface hashes requests on unclean urls [1]:

e.g. http://metrics-api.wikimedia.org/metrics/ryan_test/edit_rate?date_start=20100101000000

This scheme will lead to a large number of separate requests. To solve this problem hashing will be implemented such that only clean urls are hashed with data corresponding to different classes of requests. For example, {raw requests, aggregate requests, time-series requests, cohort aggregate requests}.

Any new requests will used what ever is contained in the existing hash and build the data missing from the hash that may or may not be needed. Subsequently, any new data generated will be added to the hash. The refresh flag will still override existing hashed data as before.

This enhancement requires that there be a definition of request types that map directly to data types. Parameters of such types would include a metric and user cohort. Once a type is determined the remaining parameters specify which data to extract.

[1] http://en.wikipedia.org/wiki/Clean_URL

rfaulkner commented 11 years ago

As a start I will redefine what we store in the hash table:

{ header : header_list, cohort_expr : cohort_gen_timestamp : metric : timeseries : aggregator : date_start : date_end : [ metric_param : ]* : data }

header_str := list(str), list of header values cohort_expr := str, cohort ID expression cohort_gen_timestamp := str, cohort generation timestamp (earliest of all cohorts in expression) metric := str, user metric handle timeseries := boolean, indicates if this is a timeseries aggregator := str, aggregator used date_start := str, start datetime of request date_end := str, end datetime of request metric_param := -, optional metric parameters data := list(tuple), set of data points

rfaulkner commented 11 years ago

implemented. https://github.com/rfaulkner/E3_analysis/commit/7459888ca1efb557a0c2db3fd64e1a5d233f2c1d and https://github.com/rfaulkner/E3_analysis/commit/7465c8cfe5a71856ae22f509a47d12c420ccf25e

Data Hashing now follows a nested key:value implementation defined by get_data and set_data: https://github.com/rfaulkner/E3_analysis/commit/50301c2bb11e82e56bddb624f60906bf5c6b64f5

A request type has also been defined that maintains request data in a more convenient form: https://github.com/rfaulkner/E3_analysis/commit/7465c8cfe5a71856ae22f509a47d12c420ccf25e