Open MMel099 opened 6 months ago
I'm working on building these endpoints, and there's no reason I can't do uploads first.
These will be going up on staging dailyish or possibly even more frequently.
And now it exists!
This is a script that will work for the endpoint, you will need the python requests
and orjson
libraries installed via pip.
from datetime import datetime
from pprint import pprint
import orjson
import requests
# make a post request to the get-participant-upload-history/v1 endpoint, including the api key,
# secret key, and participant_id as post parameters.
t1 = datetime.now()
print("Starting request at", t1, flush=True)
response = requests.post(
"https://staging.beiwe.org/get-participant-upload-history/v1/",
data={
"access_key": "your key part one",
"secret_key": "your key part two",
"participant_id": "some participant id",
# "omit_keys": "true",
},
allow_redirects=False,
)
t2 = datetime.now()
print("Request completed at", t2, "duration:", (t2 - t1).total_seconds(), "seconds")
print("http status code:", response.status_code)
assert 200 <= response.status_code < 300, f"Why is it not a 200? {response.status_code} (if it's a 301 you may have cut off the s in https)"
print("Data should be a bytes object...")
assert isinstance(response.content, bytes), f"Why is it not a bytes? {type(response.content)}"
assert response.content != b"", "buuuuuut its empty."
print("cool, cool... is it valid json?")
imported_json_response = orjson.loads(response.content)
print("json was imported! Most of these endpoints return json lists...")
if isinstance(imported_json_response, list):
print("it is a list with", len(imported_json_response), "entries!")
print("\nthe first entry is:")
pprint(imported_json_response[0])
print("\nthe last entry is:")
pprint(imported_json_response[-1])
else:
print("it is not a list, it is a", type(imported_json_response), "so you will have to inspect it yourself.")
This looks great! Going to go ahead and give it a try. Thanks!
Speed is alright on staging, but when we get it onto production it is going to be S L O W and potentially a problem for database load.
I want to brainstorm ways to reduce the amount of data.
I did make some of those changes.
omit_keys
parameter that returns the data as a list of lists instead of as a list of dicts, item ordering is the same as in the dict keys...... uuuuhhhh except those calls to pprint sort the keys, order is not as printedJust want to check in on this briefly, brainstorm what you will want to look at.
Here's a little something on what I have so far.
I started by looking at upload histories of the three RA's on the staging server. Here is the history for one of the RA's. Another RA showed a similar pattern. The last RA had no data collected before March, making it hard to compare. You can visually see that collection is way improved in early/mid March which is definitely a positive sign!
Next, I want to shift focus to quantifying 'coverage'. Something like "Beiwe has good data collection for XX hours in a day, on average". Still brainstorming how exactly this will look like so if you have any input, let me know! An idea I had is to look at how long gaps are between upload times of consecutive files. The overwhelming majority of these gaps are seconds or milliseconds, indicating good Beiwe data collection.
Here I pulled one RA's upload history for May of 2024. If we consider any gap in consecutive upload times of at least one hour to be 'significant', these are all the gaps that are significant, with units in hours. Notice how a lot of the gaps are right around whole numbers, which I attribute to the heartbeat feature.
Originally, I was thinking that I could add all these gaps up and then divide by the total time in the time period being considered. With the example above, there's about 70 hours of gaps which is equivalent to about 10% of all of May. Therefore, we would conclude that "Beiwe has good data collection 90% of the time, on average".
However, I am not convinced that these gaps are actually a good surrogate for tracking coverage - there may be long gaps that don't actually indicate worse data collection, but just less data TO collect. Would love to hear your thoughts on this.
Sorry, this reply ended up being a lot denser than I anticipated, but hoping it's clear.
1) Wow. 2) I actually thought this was our internal tracking issue 😅 - but I am very in favor of being as outwardly open, especially with searches to find flaws in the platform.
Some comments that may be useful:
participant_id
parameter.get-participant-heartbeat-history/v1
may be of interest to us, it is the record of when the app checked in - its currently ~duplicated, but configured to hit every 5 minutes. The Reason I wanted to create that data stream is so we could look at things like heartbeat compared to upload times and data gathering, with upload being our ~proxy? view into historical app performance (because it's what we've got).get-participant-version-history/v1
may be of solid utility, so can we sanity-check that it is working as expected. (Unfortunately v1 of recording this detail was totally broken, real enablement was pretty recent, I think it's on production already at the very least.)Nothing else from me for now.
Hello!
I also was working on some file upload checks. I plotted the number of files being uploaded binned by the hour of the day they are created (in UTC), and found that for GPS, Gyro, and Accelerometer there are significantly more uploads.
Next, I'm working on something similar as Max, where I'm going to bin data collection by the known "sensor on" periods (like for example in GPS we know that there should be about a minute of GPS data being collected then a period of time without data) and then count the discrete number of data collection periods.
F*ck yeah.
Morning @biblicabeebli
Hassan has asked me to look into data uploading and if there are noticeable improvements in consistency/volume of any data streams since v2.5.
To do this, I plan to look at file uploading for all the RAs Beiwe ID for one month before the update (Feb 15 - March 15)and one month after (April 15 - May 15). Would it be possible to get the json files with the full upload histories for these users?
Studies server: Yale_Fucito_Young Adult Alcohol - Live Study
Staging server: Michelle Test Study 10.3.2023
Staging server: Zhimeng Liu - Beta Test - 2.5.24
Staging server: Jenny_Prince_Test_Study_11.30.23
Thanks so much and this is not time sensitive!