The Pipeline is Being Deprecated and Replaced

biblicabeebli commented 4 years ago

This issue is for the discussion of rebuilding, improving, and speccing out changes and features of the Beiwe Data Pipeline.

It is not finished, there are definitely still details to define.

Current state of the pipeline: The Beiwe Analysis repo has a branch,Pipeline, that currently integrates with Beiwe. When the Beiwe Pipeline is set up with this branch it will generate the Beiwe Summary Statistics. However, only GPS, calls, powerstate, and texts are currently functional at all.

There is a bug in GPS (missing date column in generated data). I've received code from the original developer, who says it should fix this. I've gone ahead and added this to the Analysis repo on the pipeline-dev branch, however I'm unsure whether the affected code is directly used anymore.

Pipeline Spec proposal:

1) Move all pipeline deployment code into cluster management.

Fully automate deploying a new pipeline.
Fully automate deletion.
By creating certain files in a folder options for setting up a custom deploy can be implemented.
- (the beiwe-analysis repository will be present as committed, reference configuration files in the repository)
- a list of libraries and programs to install (probably via apt in ubuntu)
- a bash script for arbitrary configuration within the new environment
  - for example (if python) pip installs
- a json settings file containing several key values:
  - repository address to clone.
  - branch name to swap to before executing pipeline code.
  - credentials, if any, for the repository. (Uhg. Would prefer to avoid this. Serializing passwords with special characters allowed is just asking for problems, but we are talking about an arbitrary code runtime environment with access to personally identifiable information.)
  - Some to-be-determined storage configuration detail.
  - AWS server class to run on (c5, m5, etc.).
  - A name for the pipeline for display on the backend.
  - Some indicator of who to contact if there are runtime errors???
- a json settings file listing default custom environment variables for the default varient of this pipeline. Other variants can be created on the web page ui below after the default is deployed.

2) Build new website content for the pipeline By default there is no pipeline configured, and the website will state that when it is true. When pipelines are configured it will have the following details on a panel allowing site admins to edit/view (maybe study admins too):

A pipeline name, set initially by running the setup for a new pipeline.
A panel of options (checkboxes) allowing configuring automatic runs. Options are hourly, daily, weekly... maybe monthly? (Always requires a time of day.)
a link to view individual historical jobs and download their logs (see #121)
a link to live status of pipeline jobs, and jobs waiting. (see #121)
an interface for setting up variants of a codebase that will be present on the

3) Propesed spec for repo structure and runtime environment: The beiwe-analysis repo will be restructured as a reference for how to set up a pipeline repository.

A pipeline must have a git repository.
The repository's code will run inside an Ubuntu environment. (right now we use a 16.04 environment for the beiwe-analysis repo. I'm going to try to make this 18.04.)
In the root of the repository there must be a run_pipeline.sh bash script. This script is executed after the most recent version of the repo is pulled. All output will be directed into a log file, and that log file will be uploaded to the beiwe server when the script exits.
Any custom runtime details are passed as environment variables accessible to the running code.
Credentials to the Beiwe Backend server will be available in specific environment variables.

usanchez commented 4 years ago

Regarding the runtime of scripts I think that it would be awesome if we could:

Choose which endpoints we want to compute (choose if we want to run the Distance travelled statistic or not, if we want the Time spent at home endpoint or not, etc.)
Choose the frequency of analysis for each endpoint: most of the endpoints refer to daily statistics, but maybe we could also add a variable that states the time interval for the analysis e.g. hourly Distance travelled, daily distance travelled, distance travelled every 6 hours, etc.
Choose the frequency of execution: we are already doing this somehow, but I think it would be very good to have the option to choose when each endpoint is computed. It may be important to have some variables almost in real-time (for example, answers to suicidal ideation or thoughts questions) and others we can compute them weekly (e.g. participant recruitment statistics).

What do you think? Thank you very much!

biblicabeebli commented 4 years ago

Choose which endpoints we want to compute (choose if we want to run the Distance travelled statistic or not, if we want the Time spent at home endpoint or not, etc.)

Choose the frequency of analysis for each endpoint: most of the endpoints refer to daily statistics, but maybe we could also add a variable that states the time interval for the analysis e.g. hourly Distance travelled, daily distance travelled, distance travelled every 6 hours, etc.

I'm fairly certain that this is not possible without changing the analysis repo code. I am able to do that because I am not proficient in R. The repo will be modified in a limited way to rework it's structure to make that part of it conform with the proposed spec. My thought is that in order to enable different pipeline run-time that behavior must be custom on a per-pipeline-repository basis. For this reason I am proposing the concept of a pipeline variant, which is simply a run of an existing pipeline customized by the site/study admin with custom environment variables. This is what I meant in section 1:

a json settings file listing default custom environment variables for the default varient of this pipeline. Other variants can be created on the web page ui below after the default is deployed.

Choose the frequency of execution: we are already doing this somehow, but I think it would be very good to have the option to choose when each endpoint is computed. It may be important to have some variables almost in real-time (for example, answers to suicidal ideation or thoughts questions) and others we can compute them weekly (e.g. participant recruitment statistics).

Is this not addressed under section 2, item 2?

usanchez commented 4 years ago

Yeah, that's what I meant, that JSON file can contain the analysis config for each study.
What I understood in 2.2 is that there will be a panel of checkboxes that state the run frequency for the whole study. What I meant is, to have a panel of checkbox for each of the outcomes. Not for the study itself.

JulioV commented 4 years ago

Hi @biblicabeebli @usanchez, we, aka MoSHI's team, are building RAPIDS (source code). We are aiming to unify data cleaning, mining, and analysis for digital phenotyping projects.

It's open source, reproducible, modular, relies on virtual environments for Python and R dependencies, and can run in multiple cores/clusters out of the box. Since each step in the processing chain is independent they can be implemented on most programming languages (we have R and Python scripts for example) and intermediate data processing steps can be easily debugged/audited.

We are focusing on data collected with Aware and Fitbit as that's what we and most of our collaborators use but it should be straight forward to add support for data coming from Beiwe (basically getting the raw data from a Beiwe database and solving any inconsistencies between that and Aware's). Plus, it seems that as things are RAPIDS could be run as a pipeline variant with a web configuration panel like you were discussing.

Potentially, you can take advantage of RAPIDS' structure and design as well as all the features that we have implemented (and their unit testing which is in progress) and we all could put our efforts into adding more features and analysis methods (some anomaly detection algorithms are going to be added soon too). This way there's a single place where the community needs to turn to analyze their mobile data.

Let me know if this sounds like something you and the Beiwe team would be interested in Julio

biblicabeebli commented 4 years ago

@JulioV Hello! Could you provide some high-level overview of any project details of RAPIDS that you think might be particularly relevant? Links to good detailed descriptions would also be helpful, and a direct project link too?

I'm more than a bit cramped for time at this moment, so anything to help me fast-forward on research is very welcome. My priority here is to make the ease of setup for executing custom analysis code and then hooking it into a Beiwe frontend for download/availability as trivial as possible.

(I am sprinting on some feature dev, need to stay focussed on it, but will be returning to this and other todos and research for Beiwe.)

JulioV commented 4 years ago

Sure,

Our docs are the main project site link here. There is more information there about how to install it and what's the overall structure of the project.

Details that might be relevant (for Beiwe integration I assume?):

We use Snakemake under the hood. This means that in RAPIDS you can build pipelines where each step is atomic (e.g. download data, taking care of timezones, etc.) meaning that if something fails or you update a subset of your data/code only the relevant parts will be re-executed. It is also possible to spread the execution of all these steps over multiple cores (and AWS clusters I believe)
There is a single configuration point see file here. It's straightforward to modify it through a web interface.
A RAPIDS pipeline can be cloned from a git repo and deployed automatically. Our system wide dependencies are MySQL, R, Python, and Pandoc see installation instructions. In theory, it should be possible to deploy it using docker as well.

In short, once the code to download and convert Beiwe data to a format compatible with RAPIDS is in place, you could deploy it automatically to a backend and expose all configuration parameters (for participants and features) using a frontend, and then package the results for download.

biblicabeebli commented 4 years ago

Thanks for the discussion on this thread, it was informative. We have worked with Onnela Lab and there is going to be a substantial change to the way any pipeline work is run.

Henceforth, the Pipeline is deprecated. All development on running arbitrary data analysis code will be removed from the Beiwe Backend repository. The technical / administrative complexity was too high, the code was substantially out of date, and it was effectively impossible to make security guarantees.

The Pipeline is to be replaced entirely by a new codebase called Forest. This library is maintained by the Onnela lab development team, and direct integration will be added to the beiwe backend. This data processing will occur on the existing data processing server architecture that is already present as part of the beiwe backend. Updates will be as simple as running the terminate processing server command and then redeploying the most recent version using the launch script. This will guarantee a first class level of support for all data analysis development coming out of Onnela Lab while substantially cutting down on complexity and time to deployment of new features.

Some Critical details: 1) The existing Pipeline code that generates participant summaries will be part of Forest. According to Onnela Lab porting these over is largely accomplished or finished, and there have been several additions and improvements to those data points. This statement is from over a month ago.

2) There is an in-development new data visualization approach using Tableau; no decision has been made on how to, whether to, and for whom these visualizations will be made available, nor what kind of platform additions or changes will be required.

3) Integration of custom 3rd party analysis code will be taking a hiatus until the Forest library is actively in use and supported by the beiwe backend codebase. My general recommendation will be to work in collaboration with Onnela Lab on opening up and adding additional data analysis.

4) The Forest codebase is still limited to the the Onnela Lab team. So, if you have or anticipate an issue due to these changes please post new issues here, and if necessary I will move any issues reported here to the Forest repository at the appropriate time.

If you have a specific question feel free to respond here.

(edit: grammar...)

aware-ravi-bhankharia commented 4 years ago

Hi Eli,

Is it possible for me to beta test Forest? I'm currently attempting to set up the Data Analysis Pipeline since I didn't want to re-do all the feature extraction that was already implemented, but would much prefer to use the up and coming library.

biblicabeebli commented 4 years ago

In principle, sure, that would be great. FYI there is no direct integration to test out yet, and I don't know anything about whether or what you could have access to at this time.

Send me an email intro that I can forward to the Onnela Lab team, they will need to make the decision right now.

Once we are working on the integration component there will be public code to make the integration work, so it will effectively be an open beta that anyone can opt in to just by deploying a particular branch*.

(*And that code will be of beta quality.)

biblicabeebli commented 3 years ago

Update: this work has started, please contact me via email, eli@zagaran.com, if you would like to be involved in testing the new Forest related data analysis. There is no timeline yet, we are just starting.

onnela-lab / beiwe-backend

The Pipeline is Being Deprecated and Replaced #124