mozilla / mozanalysis

A library for Mozilla experiments analysis
https://mozilla.github.io/mozanalysis/
Mozilla Public License 2.0
9 stars 13 forks source link

add storage API to speed up data loading #215

Closed jaredsnyder closed 4 months ago

jaredsnyder commented 4 months ago

Fixes #77

I had a hard time figuring out how to best use the API and given the age of the story it might be out of date. I ran an experiment where I locally ran the following code on this branch and on main:

test_target = HistoricalTarget(
    experiment_name='test_targ', # A name for the analysis, which is only used in the table names when saving results in BigQuery
    start_date='2022-01-01', # First date for the dummy enrollment period of the analysis
    num_dates_enrollment=28, # Number of days used to check for clients that satisfy conditions in Segments
    analysis_length=7, # Number of days used to calculate metrics for each client in the study
)

df = test_target.get_single_window_data(
    bq_context=bq_context,
    metric_list=[active_hours, uri_count, search_count],
    target_list=[weekday_regular_us]
)

I used the jupyter %%timeit magic to measure how fast. On main it ran in 37.4 s ± 8.17 s per loop (mean ± std. dev. of 7 runs, 1 loop each) while on the feature branch it ran in 31.3 s ± 1.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each). Not very scientific, especially given the std deviation on the main, but it looks like the improvement is fairly minor in this case.

m-d-bowerman commented 4 months ago

I believe the to_dataframe method already does this under the hood, based on the create_bqstorage_client parameter that defaults to true. I guess passing the client explicitly does avoid creating that client each time the to_dataframe method is called.

jaredsnyder commented 4 months ago

I believe the to_dataframe method already does this under the hood, based on the create_bqstorage_client parameter that defaults to true. I guess passing the client explicitly does avoid creating that client each time the to_dataframe method is called.

Lol think we can just close this PR and the issue then