TextNN is a collection of Python code snippets solving different text mining tasks (on varying datasets) using deep learning.
Before using the code, please install the necessary software dependencies.
textnn
conda environment:
conda env create -f environment.yml; \
conda activate textnn
Update the conda environment (from an old version):
conda env update -f environment.yml; \
conda activate textnn
Running the code in a docker container can be achieved by building the image:
docker build --target=env-and-code --tag textnn .
and running the image in interactive mode (conda environment automatically loaded)
docker run --rm -it textnn
To be able to reflect current code changes inside the container, you can bind the current directory as code volume:
docker run --rm -v "${PWD}:/code" -it textnn
Please note, changes in the container reflect on the code directory of the host system.
To enable GPU support build with:
docker build --target=gpu-env-and-code --tag textnn .
and run:
docker run --rm --runtime=nvidia -it textnn
The recommended EC2 setup (e.g., g3s.xlarge
) is based on Deep Learning AMI (Ubuntu) Version 21.2
(ami-0e9085a8d461c2d01) with an increased volumne of 120GB or more. It is recommended to execute code via
Docker, by setting up the project and creating an image:
git clone https://github.com/tongr/TextNN && cd TextNN && \
docker build --target=gpu-env-and-code -t textnn .
And running the experiments inside the container:
docker run --rm --runtime=nvidia -v "${PWD}:/code" -it textnn
To build and push the current version (also marked as latest), run:
DATE="$(date -u +'%Y-%m-%dT%H:%M:%SZ')" && \
NAME="registry.gitlab.com/tongr/textnn" && \
VERSION="$(git describe --always)" && \
COMMIT="$(git rev-parse HEAD)" && \
docker build --target=env-and-code --build-arg "BUILD_DATE=${DATE}" --build-arg "BUILD_NAME=${NAME}" \
--build-arg "BUILD_VERSION=${VERSION}" --build-arg "VCS_REF=${COMMIT}" \
--tag ${NAME}:${VERSION} --tag ${NAME}/cpu:${VERSION} . && \
docker tag ${NAME}:${VERSION} ${NAME}:latest && \
docker tag ${NAME}/cpu:${VERSION} ${NAME}/cpu:latest && \
docker push ${NAME}:${VERSION} && docker push ${NAME}/cpu:${VERSION} \
docker push ${NAME}:latest && docker push ${NAME}/cpu:latest && \
docker build --target=gpu-env-and-code --build-arg "BUILD_DATE=${DATE}" --build-arg "BUILD_NAME=${NAME}" \
--build-arg "BUILD_VERSION=${VERSION}" --build-arg "VCS_REF=${COMMIT}" --tag ${NAME}/gpu:${VERSION} . && \
docker tag ${NAME}/gpu:${VERSION} ${NAME}/gpu:latest && \
docker push ${NAME}/gpu:${VERSION} && docker push ${NAME}/gpu:latest
Run tests:
docker run --rm -it registry.gitlab.com/tongr/textnn:latest pytest --cov -vv
The individual datasets have a specific DATASET
indicator, the parameters for the following experiments are
equivalent:
To run training and evaluation of a LSTM model to predict positive/negative reviews run:
python ./run_experiment.py [DATASET] [OPT_ARGS] train-and-test [--validation-split VALIDATION_HOLD_OUT_RATIO]
where the optional VALIDATION_HOLD_OUT_RATIO
(default 0.05
) specified how much data will be hold back for epoch
validation during training.
Further optional arguments OPT_ARGS
will influence the following areas:
--vocabulary-size VOCABULARY_SIZE
, --max-text-length MAX_TEXT_LENGTH
,
--pad-beginning [True|False]
(whether to add padding at start and end of a sequence), and
--use-start-end-indicators [True|False]
(whether to use reserved indicator token <START>
and <END>
)--embeddings [EMBEDDING_SIZE|PRETRAINED_EMBEDDINGS_FILE]
(--update-embeddings [True|False]
)--layer_definitions [LAYER_DEFINITIONS]
(layer definitions separated by pipe, e.g.,
--layer-definitions 'LSTM(16)|Dense(8)'
)--batch-size BATCH_SIZE
, --num-epochs NUM_EPOCHS
, --learning-rate LEARNING_RATE
,
--learning-decay LEARNING_DECAY
, --shuffle-training-data [True|False|RANDOM_SEED]
(RANDOM_SEED
refers to an
int
value used as the seed for the random number generator)--log-config [True|False]
(default: True
)python ./run_experiment.py [DATASET] [OPT_ARGS] \
test-encoding "This is a test sentence" "This sentence contains the unknown word klcuvhacnjbduskxuscj" \
[--show-padding [True|False]] [--show-start-end [True|False]]
This command will create representations for the two example sentences. The parameter --show-padding
forces the
output of <PAD>
indicators in the re-decoded text and --show-start-end
en-/disables <START>
and <END>
indicators. The aforementioned optional arguments OPT_ARGS
still apply.
python ./run_experiment.py [DATASET] [OPT_ARGS] \
cross-validation [--k NUMBER_OF_FOLDS]
The NUMBER_OF_FOLDS
indicates the amout of folds / splits to use for cross validation.
Aforementioned optional arguments OPT_ARGS
still apply.
The ACL IMDb dataset consists of 25,000 highly polar movie reviews for training, and 25,000 for testing and can be found here (alt. here).
Preparation: Download the dataset and extract it in the aclImdb subfolder.
curl http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz | tar -xz
In the following examples, the indicator IMDB_DATA_FOLDER
refers to the base folder of the ACL IMDb dataset:
IMDB_DATA_FOLDER=${PWD}/aclImdb/
Run experiments:
To run training and evaluation of a LSTM model to predict positive/negative reviews run:
python ./run_experiment.py imdb --data-folder [IMDB_DATA_FOLDER] [OPT_ARGS] \
train-and-test [--validation-split VALIDATION_HOLD_OUT_RATIO]
where IMDB_DATA_FOLDER
refers to the base folder of the ACL IMDb dataset and the aforementioned optional
arguments VALIDATION_HOLD_OUT_RATIO
and OPT_ARGS
still apply.
To debug the selected encoding model run:
python ./run_experiment.py imdb --data-folder [IMDB_DATA_FOLDER] [OPT_ARGS] \
test-encoding "This is a test sentence" "This sentence contains the unknown word klcuvhacnjbduskxuscj" \
[--show-padding [True|False]] [--show-start-end [True|False]]
The aforementioned optional arguments --show-padding [...]
, --show-start-end [...]
, and OPT_ARGS
still apply.
To execute k-fold cross validation based only on the training data set
python ./run_experiment.py imdb --data-folder [IMDB_DATA_FOLDER] [OPT_ARGS] \
cross-validation [--k NUMBER_OF_FOLDS]
The aforementioned optional arguments NUMBER_OF_FOLDS
and OPT_ARGS
still apply.
The Amazon reviews dataset consists of hundred million reviews by millions of Amazon customers over two decades. The reviews express opinions and describe the customer experiences regarding products on the Amazon.com website. Different review subsets are listed here: https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
Preparation:
Download a dataset (e.g., Amazon Video reviews amazon_reviews_us_Video_v1_00.tsv.gz
):
wget https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_v1_00.tsv.gz -P amazon
In the following examples, the indicator AMAZON_DATA_FILE
refers to the downloaded data file of the Amazon
dataset:
AMAZON_DATA_FILE=${PWD}/amazon/amazon_reviews_us_Video_v1_00.tsv.gz
Run experiments:
To run training and evaluation of a LSTM model to predict positive/negative reviews run:
python ./run_experiment.py amazon --data-file [AMAZON_DATA_FILE] [OPT_ARGS] \
train-and-test [--validation-split VALIDATION_HOLD_OUT_RATIO]
where AMAZON_DATA_FILE
refers to the Amazon dataset file, the aforementioned optional arguments
VALIDATION_HOLD_OUT_RATIO
and OPT_ARGS
still apply.
To debug the selected encoding model run:
python ./run_experiment.py amazon --data-file [AMAZON_DATA_FILE] [OPT_ARGS] \
test-encoding "This is a test sentence" "This sentence contains the unknown word klcuvhacnjbduskxuscj" \
[--show-padding [True|False]] [--show-start-end [True|False]]
The aforementioned optional arguments --show-padding [...]
, --show-start-end [...]
, and OPT_ARGS
still apply.
To execute k-fold cross validation based only on the training data set
python ./run_experiment.py amazon --data-folder [IMDB_DATA_FOLDER] [OPT_ARGS] \
cross-validation [--k NUMBER_OF_FOLDS]
The aforementioned optional arguments NUMBER_OF_FOLDS
and OPT_ARGS
still apply.
The YELP reviews dataset consists of approx. 6 million reviews for 200k businesses. The reviews express opinions and describe the customer experiences collected on www.yelp.com.
Preparation:
Download the dataset and extract the review.json
. In the following examples, the indicator YELP_DATA_FILE
refers to
the extracted review.json
file.
Run experiments:
To run training and evaluation of a LSTM model to predict positive/negative reviews run:
python ./run_experiment.py yelp --data-file [YELP_DATA_FILE] [OPT_ARGS] \
train-and-test [--validation-split VALIDATION_HOLD_OUT_RATIO]
where YELP_DATA_FILE
refers to the YELP dataset file, the aforementioned optional arguments
VALIDATION_HOLD_OUT_RATIO
and OPT_ARGS
still apply.
To debug the selected encoding model run:
python ./run_experiment.py yelp --data-file [YELP_DATA_FILE] [OPT_ARGS] \
test-encoding "This is a test sentence" "This sentence contains the unknown word klcuvhacnjbduskxuscj" \
[--show-padding [True|False]] [--show-start-end [True|False]]
The aforementioned optional arguments --show-padding [...]
, --show-start-end [...]
, and OPT_ARGS
still apply.
To execute k-fold cross validation based only on the training data set
python ./run_experiment.py yelp --data-folder [IMDB_DATA_FOLDER] [OPT_ARGS] \
cross-validation [--k NUMBER_OF_FOLDS]
The aforementioned optional arguments NUMBER_OF_FOLDS
and OPT_ARGS
still apply.
TODO: add description ...
TODO: add description ...
TODO: add description ...
TODO: add description ...
Pretrained word embeddings can be used by loading provided vec files. For instance, fastText - aligned word vectors (aalternatively, other word vectors)