johannespischinger / senti_anal

MIT License
2 stars 0 forks source link

Sentiment Analysis

Final project for courseMLOps course at DTU.

codecovCI pytest build-docs

Read the docs

Project Description

Overall goal of the project:

Building and running a sentimental analysis model using a pretrained model "DistilBERT" from the huggingface/transformer framework based on the dataset amazon_polarity. The dataset contains of about ~35 mio. reviews from Amazon up to March 2013 (in total about 18 years of reviews). As a result of the project, the model should analyse new Amazon reviews and classify them either as positive or negative rating. The overall goal is to learn working with the huggingface/transformer library, applying the various taught tools/frameworks from SkafteNicki/dtu_mlops to setup a proper ML operations project.

As already mentioned above, we are using the Transformer framework to access the pretrained DisttilBERT embeddings and to use the preprocessing tools (e.g. tokenizer) for the sentimental analysis. The dataset is directly loaded from the huggingface hub. Initially, we used the frozen embeddings of BERT and add a final classification layer as proposed in this jupyter notebook. However, as training ended up to be too long with BERT we switched to DistilBERT since it has only 66 mio. parameters compared with 340mio parameters from BERT.

Tools planned (or already implemented) to be used in the project:

Tools/ Frameworks/
Configurations/ Packages
Purpose
Conda environement Closed environment to facilitate package handling
Wandb Experiment logging
Hydra Managing of config files for training
Cookiecutter Setup the project environement
black flake8 Coding style
isort Sorting of imports
dvc Data Versioning
Google cloud File storage, training, deployment
docker Building train and prediction containers for deployment
fastAPI Project API for prediction interface
Huggingface Pretrained model, datamodule

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external            <- Data from third party sources.
│   ├── interim             <- Intermediate data that has been transformed.
│   ├── processed           <- The final, canonical data sets for modeling.
│   └── raw                 <- The original, immutable data dump.
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
├── models             <- Trained and serialized models, model predictions, or model summaries
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── config             <- Source code for use in this project.
│   ├── data                <- Config files defining the datamodule 
│   ├── hydra               <- Config files defining hydra setup
│   ├── logging             <- Config files defining logging in gcp, wandb
│   ├── model               <- Config files defining used model
│   ├── optim               <- Config files defining model optimizer
│   └── train               <- Config files defining train setup (pl.Trainer, metric, early stopping)
├── models              <- Folder to store pretrained models locally
├── opensentiment      <- Source code for use in this project.
│   ├── __init__.py         <- Makes src a Python module
│   ├── data                <- Script to download or generate data
│   │   └── make_dataset_pl.py
│   ├── gcp                 <- Scripts to define settings for google cloud handle
│   │   └── build_features.py
│   ├── models              <- Scripts to define and train  model and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model_pl.py
│   │   └── train_model_pl.py
│   │   └── bert_model_pl.py
│   └── api                 <- Scripts to create fastAPI
├── setup               <- Files to setup docker (.sh, .yaml, .dockerfile) and pip requirements for cpu and gpu use
│   ├── docker              <- Folder containing all files to build docker images
│   ├── pip                 <- Folder containing all files for correct pip setup depening on cpu or gpu
├── requirements.txt               <- General requirements file for project
├── requirements_gpu.txt               <- Additional requirements file for gpu handling

└── tox.ini             <- tox file with settings for running tox; see tox.readthedocs.io

Minimal Installation

Default configuration (Conda 5.10 / Ubuntu 20.04):

conda create -y --name py39senti python=3.9 pip
conda activate py39senti

# GPU below
pip install -r requirements.txt
# CUDA 11.3 configuration
# pip install -r requirements_gpu.txt

# git hooks
pre-commit install
# get data
dvc pull
# verify everything is working
coverage run -m --source=./opensentiment pytest tests