iOS 2.5 upload audit - Githubissues

MMel099 commented 6 months ago

Morning @biblicabeebli

Hassan has asked me to look into data uploading and if there are noticeable improvements in consistency/volume of any data streams since v2.5.

To do this, I plan to look at file uploading for all the RAs Beiwe ID for one month before the update (Feb 15 - March 15)and one month after (April 15 - May 15). Would it be possible to get the json files with the full upload histories for these users?

Studies server: Yale_Fucito_Young Adult Alcohol - Live Study

Staging server: Michelle Test Study 10.3.2023

tnca5ih4

Staging server: Zhimeng Liu - Beta Test - 2.5.24

g5f4djly

Staging server: Jenny_Prince_Test_Study_11.30.23

966u1cd7

Thanks so much and this is not time sensitive!

biblicabeebli commented 6 months ago

I'm working on building these endpoints, and there's no reason I can't do uploads first.

These will be going up on staging dailyish or possibly even more frequently.

Pushing anything to production servers will be a decision we make with Hassan, see https://github.com/onnela-lab/beiwe-backend/issues/365
the issue for the api development is here: https://github.com/onnela-lab/beiwe-backend/issues/354
I'll try and ping you over here when I update that issue, but please check it too and feel free to ask me for an update here.

biblicabeebli commented 6 months ago

And now it exists!

This is a script that will work for the endpoint, you will need the python requests and orjson libraries installed via pip.

from datetime import datetime
from pprint import pprint

import orjson
import requests

# make a post request to the get-participant-upload-history/v1 endpoint, including the api key,
# secret key, and participant_id as post parameters.
t1 = datetime.now()
print("Starting request at", t1, flush=True)
response = requests.post(
    "https://staging.beiwe.org/get-participant-upload-history/v1/",
    data={
        "access_key": "your key part one",
        "secret_key": "your key part two",
        "participant_id": "some participant id",
        # "omit_keys": "true",
    },
    allow_redirects=False,
)
t2 = datetime.now()
print("Request completed at", t2, "duration:", (t2 - t1).total_seconds(), "seconds")

print("http status code:", response.status_code)
assert 200 <= response.status_code < 300, f"Why is it not a 200? {response.status_code} (if it's a 301 you may have cut off the s in https)"

print("Data should be a bytes object...")
assert isinstance(response.content, bytes), f"Why is it not a bytes? {type(response.content)}"

assert response.content != b"", "buuuuuut its empty."

print("cool, cool... is it valid json?")
imported_json_response = orjson.loads(response.content)
print("json was imported! Most of these endpoints return json lists...")

if isinstance(imported_json_response, list):
    print("it is a list with", len(imported_json_response), "entries!")
    print("\nthe first entry is:")
    pprint(imported_json_response[0])
    print("\nthe last entry is:")
    pprint(imported_json_response[-1])
else:
    print("it is not a list, it is a", type(imported_json_response), "so you will have to inspect it yourself.")

MMel099 commented 5 months ago

This looks great! Going to go ahead and give it a try. Thanks!

biblicabeebli commented 5 months ago

Speed is alright on staging, but when we get it onto production it is going to be S L O W and potentially a problem for database load.

I want to brainstorm ways to reduce the amount of data.

we could drop keys, make it a list of lists instead of a list of dicts. I might make endpoints support this and like have Mano use them.
could drop the +00:00 from the datetimes, it will always be utc. This is an orjson library thing - if there's an option for that I will make that change so just be aware that it might. (orjson is a highly accelerated json parser, it uses some crazy C++ library.)
applying real compression is ... hard, it just is.
I guess we could make optimized binary representations - yuck.
I guess we could cut unnecessary data out of the file names, they don't need to include the participant id (honestly we should be doing that in the database itself).
Am I intentionally going overboard here because I've been up for too long? Yes.

biblicabeebli commented 5 months ago

I did make some of those changes.

+00:00 is now converted to Z.
No more microseconds in the timestamp.
There is now an omit_keys parameter that returns the data as a list of lists instead of as a list of dicts, item ordering is the same as in the dict keys...... uuuuhhhh except those calls to pprint sort the keys, order is not as printed

biblicabeebli commented 5 months ago

Just want to check in on this briefly, brainstorm what you will want to look at.

MMel099 commented 5 months ago

Here's a little something on what I have so far.

I started by looking at upload histories of the three RA's on the staging server. Here is the history for one of the RA's. Another RA showed a similar pattern. The last RA had no data collected before March, making it hard to compare. You can visually see that collection is way improved in early/mid March which is definitely a positive sign!

Next, I want to shift focus to quantifying 'coverage'. Something like "Beiwe has good data collection for XX hours in a day, on average". Still brainstorming how exactly this will look like so if you have any input, let me know! An idea I had is to look at how long gaps are between upload times of consecutive files. The overwhelming majority of these gaps are seconds or milliseconds, indicating good Beiwe data collection.

Here I pulled one RA's upload history for May of 2024. If we consider any gap in consecutive upload times of at least one hour to be 'significant', these are all the gaps that are significant, with units in hours. Notice how a lot of the gaps are right around whole numbers, which I attribute to the heartbeat feature.

Originally, I was thinking that I could add all these gaps up and then divide by the total time in the time period being considered. With the example above, there's about 70 hours of gaps which is equivalent to about 10% of all of May. Therefore, we would conclude that "Beiwe has good data collection 90% of the time, on average".

However, I am not convinced that these gaps are actually a good surrogate for tracking coverage - there may be long gaps that don't actually indicate worse data collection, but just less data TO collect. Would love to hear your thoughts on this.

Sorry, this reply ended up being a lot denser than I anticipated, but hoping it's clear.

biblicabeebli commented 5 months ago

1) Wow. 2) I actually thought this was our internal tracking issue 😅 - but I am very in favor of being as outwardly open, especially with searches to find flaws in the platform.

Some comments that may be useful:

I've done some cleanup over on https://github.com/onnela-lab/beiwe-backend/issues/354 , including listing the new endpoints (this is all still just on staging and subject to change, also looking for feedback). Don't have full documentation yet but you should be able to hit any of those endpoints with the script I threw in. The ones I mention below require the participant_id parameter.
the unix timestamp in the file name is the time the file was created, it should slightly precede the first data point in the file starting on some 2.5 beta version. Prior to 2.5 the app would create their files WAY in advance.
The timestamp is the file upload time, this value may be completely unrelated to the data in the file.
get-participant-heartbeat-history/v1 may be of interest to us, it is the record of when the app checked in - its currently ~duplicated, but configured to hit every 5 minutes. The Reason I wanted to create that data stream is so we could look at things like heartbeat compared to upload times and data gathering, with upload being our ~proxy? view into historical app performance (because it's what we've got).
get-participant-version-history/v1 may be of solid utility, so can we sanity-check that it is working as expected. (Unfortunately v1 of recording this detail was totally broken, real enablement was pretty recent, I think it's on production already at the very least.)

Nothing else from me for now.

jprince127 commented 4 months ago

Hello!

I also was working on some file upload checks. I plotted the number of files being uploaded binned by the hour of the day they are created (in UTC), and found that for GPS, Gyro, and Accelerometer there are significantly more uploads.

Next, I'm working on something similar as Max, where I'm going to bin data collection by the known "sensor on" periods (like for example in GPS we know that there should be about a minute of GPS data being collected then a period of time without data) and then count the discrete number of data collection periods.

biblicabeebli commented 3 months ago

F*ck yeah.

onnela-lab / beiwe-ios

iOS 2.5 upload audit #66