onnela-lab / beiwe-ios

Beiwe is a smartphone-based digital phenotyping research platform. This is the Beiwe iOS app code. The Beiwe2 app is also available on the Apple app store to use with open source builds of the Beiwe backend.
https://www.beiwe.org/
BSD 3-Clause "New" or "Revised" License
22 stars 13 forks source link

iOS 2.5 upload audit #66

Open MMel099 opened 3 months ago

MMel099 commented 3 months ago

Morning @biblicabeebli

Hassan has asked me to look into data uploading and if there are noticeable improvements in consistency/volume of any data streams since v2.5.

To do this, I plan to look at file uploading for all the RAs Beiwe ID for one month before the update (Feb 15 - March 15)and one month after (April 15 - May 15). Would it be possible to get the json files with the full upload histories for these users?

Studies server: Yale_Fucito_Young Adult Alcohol - Live Study

Staging server: Michelle Test Study 10.3.2023

Staging server: Zhimeng Liu - Beta Test - 2.5.24

Staging server: Jenny_Prince_Test_Study_11.30.23

Thanks so much and this is not time sensitive!

biblicabeebli commented 3 months ago

I'm working on building these endpoints, and there's no reason I can't do uploads first.

These will be going up on staging dailyish or possibly even more frequently.

biblicabeebli commented 3 months ago

And now it exists!

This is a script that will work for the endpoint, you will need the python requests and orjson libraries installed via pip.

from datetime import datetime
from pprint import pprint

import orjson
import requests

# make a post request to the get-participant-upload-history/v1 endpoint, including the api key,
# secret key, and participant_id as post parameters.
t1 = datetime.now()
print("Starting request at", t1, flush=True)
response = requests.post(
    "https://staging.beiwe.org/get-participant-upload-history/v1/",
    data={
        "access_key": "your key part one",
        "secret_key": "your key part two",
        "participant_id": "some participant id",
        # "omit_keys": "true",
    },
    allow_redirects=False,
)
t2 = datetime.now()
print("Request completed at", t2, "duration:", (t2 - t1).total_seconds(), "seconds")

print("http status code:", response.status_code)
assert 200 <= response.status_code < 300, f"Why is it not a 200? {response.status_code} (if it's a 301 you may have cut off the s in https)"

print("Data should be a bytes object...")
assert isinstance(response.content, bytes), f"Why is it not a bytes? {type(response.content)}"

assert response.content != b"", "buuuuuut its empty."

print("cool, cool... is it valid json?")
imported_json_response = orjson.loads(response.content)
print("json was imported! Most of these endpoints return json lists...")

if isinstance(imported_json_response, list):
    print("it is a list with", len(imported_json_response), "entries!")
    print("\nthe first entry is:")
    pprint(imported_json_response[0])
    print("\nthe last entry is:")
    pprint(imported_json_response[-1])
else:
    print("it is not a list, it is a", type(imported_json_response), "so you will have to inspect it yourself.")
MMel099 commented 3 months ago

This looks great! Going to go ahead and give it a try. Thanks!

biblicabeebli commented 3 months ago

Speed is alright on staging, but when we get it onto production it is going to be S L O W and potentially a problem for database load.

I want to brainstorm ways to reduce the amount of data.

biblicabeebli commented 3 months ago

I did make some of those changes.

biblicabeebli commented 3 months ago

Just want to check in on this briefly, brainstorm what you will want to look at.

MMel099 commented 3 months ago

Here's a little something on what I have so far.

I started by looking at upload histories of the three RA's on the staging server. Here is the history for one of the RA's. Another RA showed a similar pattern. The last RA had no data collected before March, making it hard to compare. You can visually see that collection is way improved in early/mid March which is definitely a positive sign!

Image Image

Next, I want to shift focus to quantifying 'coverage'. Something like "Beiwe has good data collection for XX hours in a day, on average". Still brainstorming how exactly this will look like so if you have any input, let me know! An idea I had is to look at how long gaps are between upload times of consecutive files. The overwhelming majority of these gaps are seconds or milliseconds, indicating good Beiwe data collection.

Here I pulled one RA's upload history for May of 2024. If we consider any gap in consecutive upload times of at least one hour to be 'significant', these are all the gaps that are significant, with units in hours. Notice how a lot of the gaps are right around whole numbers, which I attribute to the heartbeat feature.

Originally, I was thinking that I could add all these gaps up and then divide by the total time in the time period being considered. With the example above, there's about 70 hours of gaps which is equivalent to about 10% of all of May. Therefore, we would conclude that "Beiwe has good data collection 90% of the time, on average".

However, I am not convinced that these gaps are actually a good surrogate for tracking coverage - there may be long gaps that don't actually indicate worse data collection, but just less data TO collect. Would love to hear your thoughts on this.

Sorry, this reply ended up being a lot denser than I anticipated, but hoping it's clear.

biblicabeebli commented 3 months ago

1) Wow. 2) I actually thought this was our internal tracking issue 😅 - but I am very in favor of being as outwardly open, especially with searches to find flaws in the platform.

Some comments that may be useful:

Nothing else from me for now.

jprince127 commented 2 months ago

Hello!

I also was working on some file upload checks. I plotted the number of files being uploaded binned by the hour of the day they are created (in UTC), and found that for GPS, Gyro, and Accelerometer there are significantly more uploads.

Next, I'm working on something similar as Max, where I'm going to bin data collection by the known "sensor on" periods (like for example in GPS we know that there should be about a minute of GPS data being collected then a period of time without data) and then count the discrete number of data collection periods.

biblicabeebli commented 1 month ago

F*ck yeah.