nilesh2797 / zestxml

This is the official codebase for KDD 2021 paper Generalized Zero-Shot Extreme Multi-Label Learning
BSD 3-Clause "New" or "Revised" License
22 stars 1 forks source link

Data files format #1

Closed trpstra closed 2 years ago

trpstra commented 2 years ago

Hi,

Thanks for publishing this, it looks very interesting.

I would like to try it out on a different data set, but I could not figure out the format and contents of the various data files. Could you please explain the data format a bit? The contents of most files I can deduce from the header line, but the contents of the new data files like Xf.txt, Yf.txt, Y_Yf.txt and such are not immediately obvious. Also, what are the __label__ and __parent__ entries in Yf.txt?

Thanks in advance!

mfakaehler commented 2 years ago

I wondered the same thing. Here is what I've been able to figure out by comparing the files and Nilesh's paper on ZestXML:

Still, I wasn't able to figure out the meaning of __parent__ in Yf.

nilesh2797 commented 2 years ago

Apologies for the delayed response, and also thanks for your interest in the paper. Most of @mfakaehler's observations are correct but I'll add my response here as well:

I hope this clears your doubt, I'll try to add details of the data format and a data processing script in the repo. Let me know if there are any more questions.

mfakaehler commented 2 years ago

Thank you for your reply! Interesting how you added the label hierarchy to the tf-idf feature vector. I think there are plenty of other datasets that might benefit from a similar treatment. Have you published any details on that somewhere?

trpstra commented 2 years ago

Thanks @nilesh2797 and @mfakaehler, things make more sense to me now. Cheers!

trpstra commented 2 years ago

Maybe one more question, if I may. Could you please also explain what concepts the hyper-parameters bs_count, bs_alpha, bs_direct_wt, score_alpha correspond to in the paper?

Thanks a lot.

nilesh2797 commented 2 years ago

Thank you for your reply! Interesting how you added the label hierarchy to the tf-idf feature vector. I think there are plenty of other datasets that might benefit from a similar treatment. Have you published any details on that somewhere?

I don't think it's published anywhere since not many XML datasets have this information and this way of adding hierarchy information is a bit ad-hoc. I would be interested to know if you're aware of any more such datasets with given label-hierarchy

nilesh2797 commented 2 years ago

Maybe one more question, if I may. Could you please also explain what concepts the hyper-parameters bs_count, bs_alpha, bs_direct_wt, score_alpha correspond to in the paper?

Thanks a lot.

mfakaehler commented 2 years ago

Thank you for your reply! Interesting how you added the label hierarchy to the tf-idf feature vector. I think there are plenty of other datasets that might benefit from a similar treatment. Have you published any details on that somewhere?

I don't think it's published anywhere since not many XML datasets have this information and this way of adding hierarchy information is a bit ad-hoc. I would be interested to know if you're aware of any more such datasets with given label-hierarchy

This article [1] has some details on modeling datasets with a label hierarchy. Of course, EURLEX57K was already part of your study. MIMIC-III with the international ICD-9 hierarchy is probably interesting. But these ICD-Codes probably wouldn't benefit from your approach employing label features. [1] Chalkidis, I., Fergadiotis, M., Kotitsas, S., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels. 7503–7515. https://doi.org/10.18653/v1/2020.emnlp-main.607