This project quantified the risk of the endpoint of each API based on security and data sovereignty markers. In this repository, we have included a well-annotated python script for (1) the data preprocessing and feature engineering, along with the (2) machine learning pipeline. These can be found in the src
folder.
This project is part of UBC MDS' capstone project 2022 where the contributors collaborated with TeejLab.
Our proposal can be found here.
Our final report can be found here.
For a high level summary of our project and to understand our decision making choices, please refer to the technical report. We have included links to the relevant scripts and notebooks for easy navigation within the repository.
.
├── data # Data files
│ ├── preprocessed # Preprocessed Dataset
│ └── raw # Raw Dataset
├── docs # Final Proposal and Report
│ ├── proposal_book # Final Proposal
│ └── report_book # Final Report
├── model # Model file
├── notebooks # Jupyter Notebook files
│ ├── eda # Notebooks for EDA
│ └── ml # Notebooks for ML Models
├── src # Source files
│ ├── utils # Utility Functions
│ └── test # Automated tests
├── reference_material # Reference Materials
├── LICENSE # LICENSE File
├── requirements.txt # Dependencies for CI/CD Workflow
├── Makefile # Automated Script
├── CODE_OF_CONDUCT.md # Code of Conduct File
├── CONTRIBUTING.md # Contributing File
├── env.yml # Conda Environment File
└── README.md # README
data/raw
folder.You can install all the dependencies you need using conda:
# Create and activate the environment
conda env create -f env.yml
conda activate api-risk-env
Please note that this process will take approximately 10-15 minutes.
The input data needs to be in a specified format for the script to run without errors. All the column headers are required to be present in the input data file. As for the data inputs, the script will accept empty fields. More specificially, all 18 column headers need to be present, however, not all 18 columns each row needs to be completed.
A sample of the data format, description of input columns and information about mandatory fields are provided in the raw_data_format.xlxs file inside data/raw
folder.
To run the jupyter notebooks for analysis, please make sure you have install all the depenencies in the section Environment set up and run the following command
jupyter lab
Once the Jupyter Server is running, go to http://localhost:8888/ and navigate to the directory /notebooks
.
Makefile is a script that automates the process of training and testing the models with predefined parameters, in addition to generating the final report.
To run all scripts:
make all
To run preprocessing only:
make data/processed/preprocessed_train.xlsx
To create the model only:
make models/model.joblib
To predict only:
make data/processed/df_predicted.xlxs
To generate the final report only:
make book.html
To run preprocessing:
python src/preprocessing.py --endpoint_path=<path_to_endpoint> --country_path=<path_to_country>--risk_rules_path=<path_to_risk_rules> --output_path=<path_to_output> --split_data=<bool>
To create the model:
python src/create_model.py --train_path=<path_to_train_file> --save_path=<path_to_save_file>
To predict:
python src/predict.py --model_path=<path_to_model> --predict_path=<path_to_predict_file> --save_path=<path_to_save>
Generate the proposal:
jupyter-book build docs/proposal_book/ --builder pdfhtml
Generate the final report:
jupyter-book build docs/report_book/ --builder pdfhtml
To build Github Pages, first:
jupyter-book build docs/report_book/
Then navigate to the docs/report_book/
folder and run the following command:
ghp-import -n -p -f _build/html
Contributors | Github |
---|---|
Anupriya Srivastava | \@Anupriya-Sri |
Harry Chan | \@harryyikhchan |
Jacqueline Chong | \@Jacq4nn |
Son Chau | \@SonQBChau |