Open shashankg7 opened 4 years ago
Hi @shashankg7, the dataset format is fastText data format with few extension:
__label__<label 1 name> __label__<label 2 name> __label__<label 3 name...> <word 1> <word2> <word3...>
It is possible to add weighting for each word by adding -wordsWeights
option and using the following format :
__label__<label 1 name> __label__<label 2 name> __label__<label 3 name...> <word 1>:<word 1 wieght> <word2>:<word 2 wieght> <word3...>:<word 3 wieght...>
See xml_experiments
directory for some examples. run_EURLex-4K.sh
is the smallest from all the datasets.
Thanks a lot @mwydmuch for your reply.
I am able to run the code with the format you have described. Thanks!
I have one doubt. I am trying out your model on a custom multi-label short text classification (average word length of ~4). The #labels are in order of 3.5K.
I am trying out 'plt' loss function with #dimensions in [200, 300, 500]. I tried different epochs and I have also tried out varying char n-gram sizes.
But I am not able to get good results, when compared to fasttext.
Any suggestions to where I might be going wrong, or what else I could try.
Thanks
Hi,
What is the dataset format expected for multi-label classification?