This is the official codebase for KDD 2021 paper Generalized Zero-Shot Extreme Multi-Label Learning
Nilesh Gupta, Sakina Bohra, Yashoteja Prabhu, Saurabh Purohit, Manik Varma
Extreme Multi-label Learning (XML
) involves assigning the subset of most relevant labels to a data point from extremely large set of label choices. An unaddressed challenge in XML is that of predicting unseen labels with no training points.
Generalized Zero-shot XML (GZXML
) is a paradigm where the task is to tag a data point with the most relevant labels from a large universe of both seen and unseen labels.
# Build
make
# Download GZ-Eurlex-4.3K dataset
mkdir GZXML-Datasets
cd GZXML-Datasets
pip install gdown
gdown "https://drive.google.com/uc?id=1j27bQZol6gOQ7AATawShcF4jXJr3Venb"
tar -xvzf GZ-Eurlex-4.3K.tar.gz
cd -
# Train and predict ZestXML on GZ-Eurlex-4.3K dataset
./run_eurlex.sh train
./run_eurlex.sh predict
# Install dependencies of metrics.py
pip install -r requirements.txt
# Install pyxclib for evaluation
git clone https://github.com/kunaldahiya/pyxclib.git
cd pyxclib
python3 setup.py install --user
cd -
# Prints evaluation metrics
python metrics.py GZ-Eurlex-4.3K
Following Datasets were used in the paper for benchmarking GZXML
algorithms (all datasets can be downloaded from here)
Following are some statistics of these datasets:
Dataset | Num Points | Num Labels | Num Features | |||
Train | Test | Seen | Unseen | Point | Label | |
GZ-Eurlex-4.3K | 45,000 | 6,000 | 4,108 | 163 | 100,000 | 24,316 |
GZ-Amazon-1M | 914,179 | 1,465,767 | 476,381 | 483,725 | 1,000,000 | 1,476,381 |
GZ-Wikipedia-1M | 2,271,533 | 2,705,425 | 495,107 | 776,612 | 1,000,000 | 1,438,196 |
All sparse matrices are stored in text sparse matrix format, please refer to the text sparse matrix format subsection for more details. Following are the details of required files:
Xf.txt
: all features used in tf-idf
representation of documents ((trn/tst/val)_X_Xf
), ith
line denotes ith
feature in the tf-idf representation. In particular, for datasets used in the paper, it's the stemmed bigram and unigram features of documents but you can choose to have any set of features depending on your application.Yf.txt
: similar to Xf.txt
it represents features of all labels. In addition to unigrams and bigrams, we also add a unique feature specific to each label (represented by __label__<i>__<label-i-text>
, this feature will only be present in ith
label's features), this allows the model to have label specific parameters and helps it to do well on many-shot labels. Features with __parent__
in them are only specific to the GZ-EURLex-4.3K
dataset because raw labels in this dataset have some additional information about parent concepts of each label, you can safely choose to ignore these features for any other/new dataset.(trn/tst/val)_X_Xf.txt
: sparse matrix (documents x document-features) representing tf-idf
feature matrix of (trn/tst/val) input documents.Y_Yf.txt
: similar to (trn/tst/val)_X_Xf.txt
but for labels, this is the sparse matrix (labels x label-features) representing tf-idf
feature matrix of labels.trn_Y_Yf.txt
: similar to Y_Yf.txt
but contains features for only the seen labels (can be interpreted as Y_Yf[seen-labels]
)(trn/tst/val)_X_Y.txt
: sparse matrix (documents x labels) representing (trn/tst/val) document-label relevance matrix.This is a plain-text row-major representation of a sparse matrix. Following are the details of the format :
num_row
num_column
)num_row
lines follow the first line and each line represents a sparse row vector<index>:<value>
. For example if the vector is [0, 0, 0.5, 0.4, 0, 0.2]
then its sparse vector text representation is 2:0.5 3:0.4 5:0.2
(NOTE : the indexing starts from 0)GZ-Eurlex-4.3K/trn_X_Xf.txt
for sample example of a sparse matrix format@InProceedings{Gupta21,
author = "Gupta, N. and Bohra, S. and Prabhu, Y. and Purohit, S. and Varma, M.",
title = "Generalized Zero-Shot Extreme Multi-label Learning",
booktitle = "Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
month = "August",
year = "2021"
}