Data files format - Githubissues

trpstra commented 2 years ago

Hi,

Thanks for publishing this, it looks very interesting.

I would like to try it out on a different data set, but I could not figure out the format and contents of the various data files. Could you please explain the data format a bit? The contents of most files I can deduce from the header line, but the contents of the new data files like Xf.txt, Yf.txt, Y_Yf.txt and such are not immediately obvious. Also, what are the __label__ and __parent__ entries in Yf.txt?

Thanks in advance!

mfakaehler commented 2 years ago

I wondered the same thing. Here is what I've been able to figure out by comparing the files and Nilesh's paper on ZestXML:

Xf is an index of all bigrams appearing in the entire text corpus. Dimensionality C (cf. paper for notation)
(trn/tst/val)_X_Xf is a spare tf_idf representation of the text corpus. Rows reference documents, columns refer to the index of bigrams Xf. col-position:tf-idf-weight. Dimensionality is |trn| x C, |tst| x C, |val| x C, respectively.
Yf is an index of labels, bigrams of labels and whatever __parent__ means. Dimensionality D
Y_Yf is a sparse matrix representation of the tf_idf featurization of labels, col-position:tf-idf-weight. This contains the vectors z_l for each Label l. Rows reference labels, column positions refer to the index-file Yf. Dimensionality is L x D
trn_Y_Yf is similar to Y_Yf, but would contain only feature vectors for labels seen during training.
(trn/tst/val)_X_Y is a sparse matrix representation of the document-label-assignments (ground trouth), each with L columns and rows according to the trn/st/val-split

Still, I wasn't able to figure out the meaning of __parent__ in Yf.

nilesh2797 commented 2 years ago

Apologies for the delayed response, and also thanks for your interest in the paper. Most of @mfakaehler's observations are correct but I'll add my response here as well:

Xf.txt: all features used in tf-idf representation of documents ((trn/tst/val)_X_Xf), ith line denotes ith feature in the tf-idf representation. In particular, for datasets used in the paper, it's the stemmed bigram and unigram features of documents but you can choose to have any set of features depending on your application.
Yf.txt: similar to Xf.txt it represents features of all labels. In addition to unigrams and bigrams, we also add a unique feature specific to each label (represented by __label__<i>__<label-i-text>, this feature will only be present in ith label's features), this allows the model to have label specific parameters and helps it to do well on many-shot labels. Features with __parent__ in them are only specific to the GZ-EURLex-4.3K dataset because raw labels in this dataset have some additional information about parent concepts of each label, you can safely choose to ignore these features for any other/new dataset.
Y_Yf.txt: similar to (trn/tst/val)_X_Xf.txt but for labels, this is the sparse matrix representing tf-idf feature matrix of labels

I hope this clears your doubt, I'll try to add details of the data format and a data processing script in the repo. Let me know if there are any more questions.

mfakaehler commented 2 years ago

Thank you for your reply! Interesting how you added the label hierarchy to the tf-idf feature vector. I think there are plenty of other datasets that might benefit from a similar treatment. Have you published any details on that somewhere?

trpstra commented 2 years ago

Thanks @nilesh2797 and @mfakaehler, things make more sense to me now. Cheers!

trpstra commented 2 years ago

Maybe one more question, if I may. Could you please also explain what concepts the hyper-parameters bs_count, bs_alpha, bs_direct_wt, score_alpha correspond to in the paper?

Thanks a lot.

nilesh2797 commented 2 years ago

Thank you for your reply! Interesting how you added the label hierarchy to the tf-idf feature vector. I think there are plenty of other datasets that might benefit from a similar treatment. Have you published any details on that somewhere?

I don't think it's published anywhere since not many XML datasets have this information and this way of adding hierarchy information is a bit ad-hoc. I would be interested to know if you're aware of any more such datasets with given label-hierarchy

nilesh2797 commented 2 years ago

Maybe one more question, if I may. Could you please also explain what concepts the hyper-parameters bs_count, bs_alpha, bs_direct_wt, score_alpha correspond to in the paper?

Thanks a lot.

bs_count is the row-wise model sparsity parameter (K in the paper), increasing K generally improves model performance but at the cost of increased training and prediction time (40 is a good default)
bs_alpha and bs_direct_wt are hyperparameters used to get the initial approximation for the w matrix (both are between 0 and 1, you can try to tune bs_direct_wt as per your dataset but default bs_alpha should give good results)
score_alpha is a prediction time hyperparameter that determines what weight should be given to model scores vs exact token match scores (on some datasets it helps to give additional weightage to exact input-output token match), it's usually kept between 0.8-0.9

mfakaehler commented 2 years ago

Thank you for your reply! Interesting how you added the label hierarchy to the tf-idf feature vector. I think there are plenty of other datasets that might benefit from a similar treatment. Have you published any details on that somewhere?

I don't think it's published anywhere since not many XML datasets have this information and this way of adding hierarchy information is a bit ad-hoc. I would be interested to know if you're aware of any more such datasets with given label-hierarchy

This article [1] has some details on modeling datasets with a label hierarchy. Of course, EURLEX57K was already part of your study. MIMIC-III with the international ICD-9 hierarchy is probably interesting. But these ICD-Codes probably wouldn't benefit from your approach employing label features. [1] Chalkidis, I., Fergadiotis, M., Kotitsas, S., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels. 7503–7515. https://doi.org/10.18653/v1/2020.emnlp-main.607

nilesh2797 / zestxml

Data files format #1