ilimugur / short-text-classification

A command line tool for training deep network models for short text classification
MIT License
20 stars 4 forks source link

error training model #2

Open connormeaton opened 4 years ago

connormeaton commented 4 years ago

First off, thanks for code, this is really great work.

I am having trouble training the model with the command example you provided. I am using this command to train the model:

$ python core.py --model Lee-Dernoncourt --dataset SwDA --embedding GloVe

I am replacing the paths to reflect where I unzipped/stored the swda dataset / glove and word2vec embeddings, which looks as below:

$ python core.py --model Lee-Dernoncourt --dataset SwDA swda/data --embedding word_embedding/word2vec/GoogleNews-vectors-negative300.bin

Inside of swda/data contains subdirectories 'sw00ut', 'sw01utt', and so on. Running this command yields the following error:

_KeyError: '/wordembedding/word2vec/GoogleNews-vectors-negative300.bin'

If I change the command to:

$ python core.py --model Lee-Dernoncourt --dataset SwDA swda/data --embedding word2vec word_embedding/word2vec/GoogleNews-vectors-negative300.bin

Then I get this error:

_core.py: error: unrecognized arguments: /wordembedding/word2vec/GoogleNews-vectors-negative300.bin

If you have any ideas on how to proceed, please advise. Thank you very much,

Best, Connor

keyan commented 4 years ago

Per the docs and your comment, the execution command should be:

$ python core.py --model Lee-Dernoncourt
--dataset SwDA <path_to_SwDA_dataset_directory>
--embedding GloVe <path_to_GloVe_embedding_file>

So, instead of:

--embedding word_embedding/word2vec/GoogleNews-vectors-negative300.bin

You should be using:

--embedding GloVe word_embedding/word2vec/GoogleNews-vectors-negative300.bin

Otherwise you will cause the script to barf here when indexing into the emeddings dict: https://github.com/ilimugur/short-text-classification/blob/77306e3900c1beaf093ed0b515bbf0cc68232861/core.py#L133

That said, you'll notice the emeddings dict has all but 'FastText' commented out: https://github.com/ilimugur/short-text-classification/blob/77306e3900c1beaf093ed0b515bbf0cc68232861/core.py#L17-L21

Unclear if there is an issue with the other emeddings or if they were accidentally commented out, but I'd suggest starting with the yet-uncommented embedding in case:

$ python core.py
--model Lee-Dernoncourt
--dataset SwDA swda/data
--embedding FastText word_embedding/word2vec/GoogleNews-vectors-negative300.bin
boghrati commented 4 years ago

I have the same issue. Seems like the code only supports FastText. So I downloaded the word embedddings for FastText and tried the following command: python core.py --model Lee-Dernoncourt --dataset SwDA swda/ --embedding FastText Then I get an error for providing --source-language. Then tried: !python core.py --model Lee-Dernoncourt --dataset SwDA swda/ --embedding FastText --source-language en But get the following error: error: argument --source-language: expected 3 arguments which I don't know how to handle at the moment. Let me know if you have any updates!

boghrati commented 4 years ago

@cmeaton here's a quick update: I was able to run the model with no error using the following command, however the accuracy is extremely low! So probably not worth investing into figuring out the code. python core.py --model Lee-Dernoncourt --dataset SwDA swda/swda/ --embedding FastText --source-language en cc.en.300.vec None

connormeaton commented 4 years ago

@boqrat Thanks for your comments! That's great you were able to get it running, but bummer on low accuracy, thanks for the heads up. I've since moved on to other things, but if I take another crack at this I'll let you know my progress.