mwydmuch / extremeText

Library for fast text representation and extreme classification.
Other
150 stars 16 forks source link

Dataset format expected #9

Open shashankg7 opened 4 years ago

shashankg7 commented 4 years ago

Hi,

What is the dataset format expected for multi-label classification?

mwydmuch commented 4 years ago

Hi @shashankg7, the dataset format is fastText data format with few extension:

__label__<label 1 name> __label__<label 2 name> __label__<label 3 name...> <word 1> <word2> <word3...>

It is possible to add weighting for each word by adding -wordsWeights option and using the following format :

__label__<label 1 name> __label__<label 2 name> __label__<label 3 name...> <word 1>:<word 1 wieght> <word2>:<word 2 wieght> <word3...>:<word 3 wieght...>

See xml_experiments directory for some examples. run_EURLex-4K.sh is the smallest from all the datasets.

shashankg7 commented 4 years ago

Thanks a lot @mwydmuch for your reply.

I am able to run the code with the format you have described. Thanks!

I have one doubt. I am trying out your model on a custom multi-label short text classification (average word length of ~4). The #labels are in order of 3.5K.

I am trying out 'plt' loss function with #dimensions in [200, 300, 500]. I tried different epochs and I have also tried out varying char n-gram sizes.

But I am not able to get good results, when compared to fasttext.

Any suggestions to where I might be going wrong, or what else I could try.

Thanks