nilesh2797 / zestxml

This is the official codebase for KDD 2021 paper Generalized Zero-Shot Extreme Multi-Label Learning
BSD 3-Clause "New" or "Revised" License
22 stars 1 forks source link
extreme-classification generalized-zero-shot-learning large-scale-machine-learning machine-learning

Generalized Zero-Shot Extreme Multi-Label Learning

This is the official codebase for KDD 2021 paper Generalized Zero-Shot Extreme Multi-Label Learning

Nilesh Gupta, Sakina Bohra, Yashoteja Prabhu, Saurabh Purohit, Manik Varma

Overview

Extreme Multi-label Learning (XML) involves assigning the subset of most relevant labels to a data point from extremely large set of label choices. An unaddressed challenge in XML is that of predicting unseen labels with no training points. 

Generalized Zero-shot XML (GZXML) is a paradigm where the task is to tag a data point with the most relevant labels from a large universe of both seen and unseen labels.

Running the Code

# Build
make

# Download GZ-Eurlex-4.3K dataset
mkdir GZXML-Datasets
cd GZXML-Datasets
pip install gdown
gdown "https://drive.google.com/uc?id=1j27bQZol6gOQ7AATawShcF4jXJr3Venb"
tar -xvzf GZ-Eurlex-4.3K.tar.gz
cd -

# Train and predict ZestXML on GZ-Eurlex-4.3K dataset
./run_eurlex.sh train
./run_eurlex.sh predict

# Install dependencies of metrics.py
pip install -r requirements.txt
# Install pyxclib for evaluation
git clone https://github.com/kunaldahiya/pyxclib.git
cd pyxclib
python3 setup.py install --user
cd -

# Prints evaluation metrics
python metrics.py GZ-Eurlex-4.3K

Public Datasets

Following Datasets were used in the paper for benchmarking GZXML algorithms (all datasets can be downloaded from here)

Following are some statistics of these datasets:

Dataset Num Points Num Labels Num Features
Train Test Seen Unseen Point Label
GZ-Eurlex-4.3K 45,000 6,000 4,108 163 100,000 24,316
GZ-Amazon-1M 914,179 1,465,767 476,381 483,725 1,000,000 1,476,381
GZ-Wikipedia-1M 2,271,533 2,705,425 495,107 776,612 1,000,000 1,438,196

Data Format

All sparse matrices are stored in text sparse matrix format, please refer to the text sparse matrix format subsection for more details. Following are the details of required files:

Text sparse matrix format

This is a plain-text row-major representation of a sparse matrix. Following are the details of the format :

Cite

@InProceedings{Gupta21,
  author    = "Gupta, N. and Bohra, S. and Prabhu, Y. and Purohit, S. and Varma, M.",
  title     = "Generalized Zero-Shot Extreme Multi-label Learning",
  booktitle = "Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
  month     = "August",
  year      = "2021"
}