Open b-analyst opened 1 year ago
Hi @b-analyst! Thanks for opening the issue.
The output format of the predict
method is a list of lists, where each list corresponds to the row (example) in X_test
.
The Y_pred[0]
is a list of k
labels (indices of labels as in Y_train
) with the highest probabilities for X_test[0]
(in the order of descending probabilities), so the Y_pred[0][0]
indicates the index of most probably label for example X_test[0]
.
You can also use the predict_proba
method that will output the list of lists of two-element tuples Y_pred[0][0] = (<index>, <probability>)
. Then Y_pred[0][0][0]
gives index of the most probable label for X_test[0]
and Y_pred[0][0][1]
it's estimated probability.
You can convert both of these outputs to csr_matrix using:
from napkinxc.datasets import to_csr_matrix
Y_pred = plt.predict_proba(X_test, top_k=10)
Y_pred = to_csr_matrix(Y_pred, shape=Y_test, sort_indices=True, dtype=np.float32):
Also you can use both X and Y as dense numpy arrays directly with napkinXC, don't need to convert it to csr_matrix (as X in your example are dense embeddings, it's ok to keep them in that format).
Thank you!
I'm trying to experiment with NapkinXC with a custom XMLC dataset, but I'm unsure how to prepare the input text and labels, and how to decode the output. Currently, I have the following code to prepare text embeddings and one-hot encoded labels:
Label preparation
Text embeddings
Then I proceed to convert these vectors to csr_matrices via:
Training
I follow the quickstart like so:
This code runs, but I'm not sure how to interpret the result. Y_pred returns a list of lists containing integers (e.g. [[2316, 1056, 1691, 1690, 2322, 1064, 2315, 2301, 1714, 2302]]) and I'm unable to decode this to the original labels. Am I doing the data preparation correctly? How should I go about decoding the output labels? Thank you.