Closed trpstra closed 2 years ago
I wondered the same thing. Here is what I've been able to figure out by comparing the files and Nilesh's paper on ZestXML:
__parent__
means. Dimensionality DStill, I wasn't able to figure out the meaning of __parent__
in Yf.
Apologies for the delayed response, and also thanks for your interest in the paper. Most of @mfakaehler's observations are correct but I'll add my response here as well:
Xf.txt
: all features used in tf-idf
representation of documents ((trn/tst/val)_X_Xf
), ith
line denotes ith
feature in the tf-idf representation. In particular, for datasets used in the paper, it's the stemmed bigram and unigram features of documents but you can choose to have any set of features depending on your application.Yf.txt
: similar to Xf.txt
it represents features of all labels. In addition to unigrams and bigrams, we also add a unique feature specific to each label (represented by __label__<i>__<label-i-text>
, this feature will only be present in ith
label's features), this allows the model to have label specific parameters and helps it to do well on many-shot labels. Features with __parent__
in them are only specific to the GZ-EURLex-4.3K
dataset because raw labels in this dataset have some additional information about parent concepts of each label, you can safely choose to ignore these features for any other/new dataset.Y_Yf.txt
: similar to (trn/tst/val)_X_Xf.txt
but for labels, this is the sparse matrix representing tf-idf
feature matrix of labelsI hope this clears your doubt, I'll try to add details of the data format and a data processing script in the repo. Let me know if there are any more questions.
Thank you for your reply! Interesting how you added the label hierarchy to the tf-idf feature vector. I think there are plenty of other datasets that might benefit from a similar treatment. Have you published any details on that somewhere?
Thanks @nilesh2797 and @mfakaehler, things make more sense to me now. Cheers!
Maybe one more question, if I may. Could you please also explain what concepts the hyper-parameters bs_count
, bs_alpha
, bs_direct_wt
, score_alpha
correspond to in the paper?
Thanks a lot.
Thank you for your reply! Interesting how you added the label hierarchy to the tf-idf feature vector. I think there are plenty of other datasets that might benefit from a similar treatment. Have you published any details on that somewhere?
I don't think it's published anywhere since not many XML datasets have this information and this way of adding hierarchy information is a bit ad-hoc. I would be interested to know if you're aware of any more such datasets with given label-hierarchy
Maybe one more question, if I may. Could you please also explain what concepts the hyper-parameters
bs_count
,bs_alpha
,bs_direct_wt
,score_alpha
correspond to in the paper?Thanks a lot.
bs_count
is the row-wise model sparsity parameter (K
in the paper), increasing K
generally improves model performance but at the cost of increased training and prediction time (40 is a good default)bs_alpha
and bs_direct_wt
are hyperparameters used to get the initial approximation for the w
matrix (both are between 0 and 1, you can try to tune bs_direct_wt
as per your dataset but default bs_alpha
should give good results)score_alpha
is a prediction time hyperparameter that determines what weight should be given to model scores vs exact token match scores (on some datasets it helps to give additional weightage to exact input-output token match), it's usually kept between 0.8-0.9Thank you for your reply! Interesting how you added the label hierarchy to the tf-idf feature vector. I think there are plenty of other datasets that might benefit from a similar treatment. Have you published any details on that somewhere?
I don't think it's published anywhere since not many XML datasets have this information and this way of adding hierarchy information is a bit ad-hoc. I would be interested to know if you're aware of any more such datasets with given label-hierarchy
This article [1] has some details on modeling datasets with a label hierarchy. Of course, EURLEX57K was already part of your study. MIMIC-III with the international ICD-9 hierarchy is probably interesting. But these ICD-Codes probably wouldn't benefit from your approach employing label features. [1] Chalkidis, I., Fergadiotis, M., Kotitsas, S., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels. 7503–7515. https://doi.org/10.18653/v1/2020.emnlp-main.607
Hi,
Thanks for publishing this, it looks very interesting.
I would like to try it out on a different data set, but I could not figure out the format and contents of the various data files. Could you please explain the data format a bit? The contents of most files I can deduce from the header line, but the contents of the new data files like Xf.txt, Yf.txt, Y_Yf.txt and such are not immediately obvious. Also, what are the
__label__
and__parent__
entries in Yf.txt?Thanks in advance!