Generalized Zero-Shot Extreme Multi-Label Learning

This is the official codebase for KDD 2021 paper Generalized Zero-Shot Extreme Multi-Label Learning

Nilesh Gupta, Sakina Bohra, Yashoteja Prabhu, Saurabh Purohit, Manik Varma

Overview

Extreme Multi-label Learning (XML) involves assigning the subset of most relevant labels to a data point from extremely large set of label choices. An unaddressed challenge in XML is that of predicting unseen labels with no training points.

Generalized Zero-shot XML (GZXML) is a paradigm where the task is to tag a data point with the most relevant labels from a large universe of both seen and unseen labels.

Running the Code

# Build
make

# Download GZ-Eurlex-4.3K dataset
mkdir GZXML-Datasets
cd GZXML-Datasets
pip install gdown
gdown "https://drive.google.com/uc?id=1j27bQZol6gOQ7AATawShcF4jXJr3Venb"
tar -xvzf GZ-Eurlex-4.3K.tar.gz
cd -

# Train and predict ZestXML on GZ-Eurlex-4.3K dataset
./run_eurlex.sh train
./run_eurlex.sh predict

# Install dependencies of metrics.py
pip install -r requirements.txt
# Install pyxclib for evaluation
git clone https://github.com/kunaldahiya/pyxclib.git
cd pyxclib
python3 setup.py install --user
cd -

# Prints evaluation metrics
python metrics.py GZ-Eurlex-4.3K

Public Datasets

Following Datasets were used in the paper for benchmarking GZXML algorithms (all datasets can be downloaded from here)

GZ-EURLex-4.3K, Document Tagging of EU law pages
GZ-Amazon-1M, Item to Item Recommendation of Amazon products
GZ-Wikipedia-1M, Document Tagging of Wikipedia pages

Following are some statistics of these datasets:

Dataset	Num Points		Num Labels		Num Features
	Train	Test	Seen	Unseen	Point	Label
GZ-Eurlex-4.3K	45,000	6,000	4,108	163	100,000	24,316
GZ-Amazon-1M	914,179	1,465,767	476,381	483,725	1,000,000	1,476,381
GZ-Wikipedia-1M	2,271,533	2,705,425	495,107	776,612	1,000,000	1,438,196

Data Format

All sparse matrices are stored in text sparse matrix format, please refer to the text sparse matrix format subsection for more details. Following are the details of required files:

Xf.txt: all features used in tf-idf representation of documents ((trn/tst/val)_X_Xf), ith line denotes ith feature in the tf-idf representation. In particular, for datasets used in the paper, it's the stemmed bigram and unigram features of documents but you can choose to have any set of features depending on your application.
Yf.txt: similar to Xf.txt it represents features of all labels. In addition to unigrams and bigrams, we also add a unique feature specific to each label (represented by __label__<i>__<label-i-text>, this feature will only be present in ith label's features), this allows the model to have label specific parameters and helps it to do well on many-shot labels. Features with __parent__ in them are only specific to the GZ-EURLex-4.3K dataset because raw labels in this dataset have some additional information about parent concepts of each label, you can safely choose to ignore these features for any other/new dataset.
(trn/tst/val)_X_Xf.txt: sparse matrix (documents x document-features) representing tf-idf feature matrix of (trn/tst/val) input documents.
Y_Yf.txt: similar to (trn/tst/val)_X_Xf.txt but for labels, this is the sparse matrix (labels x label-features) representing tf-idf feature matrix of labels.
trn_Y_Yf.txt: similar to Y_Yf.txt but contains features for only the seen labels (can be interpreted as Y_Yf[seen-labels])
(trn/tst/val)_X_Y.txt: sparse matrix (documents x labels) representing (trn/tst/val) document-label relevance matrix.

Text sparse matrix format

This is a plain-text row-major representation of a sparse matrix. Following are the details of the format :

The first line in this format is two space separated integers denoting the dimensions of the matrix (i.e. num_row num_column)
num_row lines follow the first line and each line represents a sparse row vector
a sparse row vector is represented as space separated non zero entries of the vector, an entry in the vector is represented as <index>:<value>. For example if the vector is [0, 0, 0.5, 0.4, 0, 0.2] then its sparse vector text representation is 2:0.5 3:0.4 5:0.2 (NOTE : the indexing starts from 0)
You can check GZ-Eurlex-4.3K/trn_X_Xf.txt for sample example of a sparse matrix format

Cite

@InProceedings{Gupta21,
  author    = "Gupta, N. and Bohra, S. and Prabhu, Y. and Purohit, S. and Varma, M.",
  title     = "Generalized Zero-Shot Extreme Multi-label Learning",
  booktitle = "Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
  month     = "August",
  year      = "2021"
}

nilesh2797 / zestxml

readme