Official repository for ECCV2024 paper ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection
Python requirements are specified in requirements.txt.
Note: We use a few functions from TensorFlow Addons (TFA) for data augmentation. TFA is no longer maintained and does not work with the latest TensorFlow versions (it works with the version specified in requirements.txt). Maintained replacements for these functions seem to exist in the TensorFlow Models (TFM) library. Feel free to use these instead if you want to use the latest TensorFlow versions. We might update this repo to use TFM instead of TFA at some point.
Optional: Dockerfile defines a working docker environment for running this code.
Make sure the prosub directory is in your Python path when running this code:
REPODIR=/path/to/this/repo
export PYTHONPATH=$PYTHONPATH:$REPODIR
Set bash variables specifying where to store data and training results:
DATADIR="/directory/for/storing/data"
TRAINDIR="/directory/for/storing/checkpoints/and/results"
Create the data directory if it does not already exist:
mkdir -p $DATADIR
Optional: set logging level = 1 to disable info messages
export TF_CPP_MIN_LOG_LEVEL=1
(0 prints all messages, 1 disables info messages, 2 disables info & warning messages, 3 disables all messages)
For CIFAR-10, CIFAR-100, and TinyImageNet, we use tfrecord files. For ImageNet30 and ImageNet100 we read the data from image files. This section describes how to download and prepare the datasets.
Download the tar files following the links named "ImageNet-30-train" and "ImageNet-30-test" in the CSI repo.
Extract the files:
mkdir $DATADIR/imagenet30
tar -xvf one_class_train.tar -C $DATADIR/imagenet30
tar -xvf one_class_test.tar -C $DATADIR/imagenet30
Download the zip-file from the Kaggle page.
Extract the files and rename the directories:
unzip imagenet100.zip -d $DATADIR/imagenet100
cd $DATADIR/imagenet100
mkdir train
mv train.X1/* train.X2/* train.X3/* train.X4/* train/
mv val.X val
rm -r train.X*
The following script downloads the data and prepares the tfrecord files for CIFAR-10, CIFAR-100, and TinyImageNet. It also checks if the ImageNet datasets are correctly installed.
python3 $REPODIR/scripts/create_datasets.py \
--datadir=$DATADIR \
--repodir=$REPODIR
To create the labeled subsets, run e.g.
python3 $REPODIR/scripts/create_split.py \
--seed=1 \
--size=5000 \
$DATADIR/SSL2/tinyimagenet-id \
$DATADIR/tinyimagenet-id-train.tfrecord
which creates a labeled subset from the tinyimagenet-id-train records with 5,000 samples using random seed 1. It generates the the file tinyimagenet-id.1@5000-label.tfrecord
in the directory $DATADIR/SSL2
.
The corresponding command for CIFAR is e.g.
python3 $REPODIR/scripts/create_split.py \
--seed=1 \
--size=4000 \
$DATADIR/SSL2/cifar10 \
$DATADIR/cifar10-train.tfrecord
We do not need to run scripts/create_split.py
for ImageNet30 and ImageNet100. The labeled subsets for these datasets are defined in the text files $REPODIR/data-files/imagenet100-id.0@5000-label.txt
and $REPODIR/data-files/imagenet30-id.0@2600-label.txt
.
Here are examples of how to run ProSub for the different datasets with the configurations used for the results in the paper.
CIFAR-10
python3 $REPODIR/prosub_ossl.py \
--datadir=$DATADIR \
--traindir=$TRAINDIR \
--pretrainsteps=50000 \
--trainsteps=$((2**19)) \
--dataset=cifar10 \
--datasetood=cifar100 \
--datasetunseen=cifar100 \
--nlabeled=4000 \
--ws=10.0 \
--arch=WRN-28-2 \
--seed=1
CIFAR-100
DECAYFACTOR=$(bc <<< "scale=4; 5/8")
python3 $REPODIR/prosub_ossl.py \
--datadir=$DATADIR \
--traindir=$TRAINDIR \
--pretrainsteps=50000 \
--trainsteps=$((2**19)) \
--dataset=cifar100 \
--datasetood=cifar10 \
--datasetunseen=cifar10 \
--nlabeled=2500 \
--decayfactor=$DECAYFACTOR \
--ws=15.0 \
--wd=0.001 \
--arch=WRN-28-8 \
--seed=1
TinyImageNet
DECAYFACTOR=$(bc <<< "scale=4; 5/8")
python3 $REPODIR/prosub_ossl.py \
--datadir=$DATADIR \
--traindir=$TRAINDIR \
--pretrainsteps=50000 \
--trainsteps=$((2**19)) \
--dataset=tinyimagenet-id \
--datasetood=tinyimagenet-ood \
--datasetunseen=tinyimagenet-ood \
--nlabeled=5000 \
--decayfactor=$DECAYFACTOR \
--ws=50.0 \
--wd=0.001 \
--arch=WRN-28-4 \
--seed=1
ImageNet30
python3 $REPODIR/prosub_ossl.py \
--datadir=$DATADIR \
--traindir=$TRAINDIR \
--pretrainsteps=30000 \
--trainsteps=100000 \
--dataset=imagenet30-id \
--datasetood=imagenet30-ood \
--datasetunseen=imagenet30-ood \
--nlabeled=2600 \
--ws=20.0 \
--pi=0.66 \
--arch=ResNet18 \
--seed=0
ImageNet100
python3 $REPODIR/prosub_ossl.py \
--datadir=$DATADIR \
--traindir=$TRAINDIR \
--pretrainsteps=30000 \
--trainsteps=100000 \
--dataset=imagenet100-id \
--datasetood=imagenet100-ood \
--datasetunseen=imagenet100-ood \
--nlabeled=5000 \
--ws=40.0 \
--arch=ResNet18 \
--seed=0
Note:
--datasetunseen
does not affect training. It can be used to make evaluations on a third dataset, unseen during training.-seed
only affects the selection of labeled data. It does not seed other random sources during training. Needs to be set to 0 for ImageNet runs because we only use single predefined labeled subsets for these runs.Results are stored in summary files in the training directory. View results with tensorboard using
tensorboard --logdir $TRAINDIR