Preparation of custom dataset for training

b-analyst commented 1 year ago

I'm trying to experiment with NapkinXC with a custom XMLC dataset, but I'm unsure how to prepare the input text and labels, and how to decode the output. Currently, I have the following code to prepare text embeddings and one-hot encoded labels:

Label preparation

from sklearn.preprocessing import MultiLabelBinarizer
import ast
from tqdm.auto import tqdm
y = MultiLabelBinarizer()
subclasses = df['subclass_id'].to_list()
subclasses = [ast.literal_eval(subclass) for subclass in tqdm(subclasses)]
labels = y.fit_transform(subclasses)

Text embeddings

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-miniLM-l6-v2', device='cuda')
model.max_seq_length = 256
print("Max Sequence Length:", model.max_seq_length)
sentence_embeddings = model.encode(
    df['patent_text'].values,
    batch_size=512,
    show_progress_bar=True,
    convert_to_numpy=True,
    device='cuda',
)

Then I proceed to convert these vectors to csr_matrices via:

from scipy.sparse import csr_matrix
import numpy as np
X = csr_matrix(X_train.astype(np.float32))
Y = csr_matrix(y_train.astype(np.float32))
X_test = csr_matrix(X_test.astype(np.float32))
Y_test = csr_matrix(y_test.astype(np.float32))

Training

I follow the quickstart like so:

from napkinxc.models import PLT
from napkinxc.measures import precision_at_k
plt = PLT("USPC-model")
plt.fit(X, Y)
Y_pred = plt.predict(X_test, top_k=10)
print(precision_at_k(Y_test, Y_pred, k=10))

This code runs, but I'm not sure how to interpret the result. Y_pred returns a list of lists containing integers (e.g. [[2316, 1056, 1691, 1690, 2322, 1064, 2315, 2301, 1714, 2302]]) and I'm unable to decode this to the original labels. Am I doing the data preparation correctly? How should I go about decoding the output labels? Thank you.

mwydmuch commented 1 year ago

Hi @b-analyst! Thanks for opening the issue.

The output format of the predict method is a list of lists, where each list corresponds to the row (example) in X_test. The Y_pred[0] is a list of k labels (indices of labels as in Y_train) with the highest probabilities for X_test[0] (in the order of descending probabilities), so the Y_pred[0][0] indicates the index of most probably label for example X_test[0].

You can also use the predict_proba method that will output the list of lists of two-element tuples Y_pred[0][0] = (<index>, <probability>). Then Y_pred[0][0][0] gives index of the most probable label for X_test[0] and Y_pred[0][0][1] it's estimated probability.

You can convert both of these outputs to csr_matrix using:

from napkinxc.datasets import to_csr_matrix

Y_pred = plt.predict_proba(X_test, top_k=10)
Y_pred = to_csr_matrix(Y_pred, shape=Y_test, sort_indices=True, dtype=np.float32):

Also you can use both X and Y as dense numpy arrays directly with napkinXC, don't need to convert it to csr_matrix (as X in your example are dense embeddings, it's ok to keep them in that format).

b-analyst commented 1 year ago

Thank you!

mwydmuch / napkinXC