Text Classification in Deep Detect

jolibrain / deepdetect

Deep Learning API and Server in C++14 support for PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

https://www.deepdetect.com/

Other

2.52k stars 561 forks source link

Text Classification in Deep Detect #31

Closed vasants closed 9 years ago

vasants commented 9 years ago

Hi,

Great job on making Caffe as a service. I wanted to find out a few more details about the text classification capabilities in Deep Detect. Are you using a embed layer (word2vec or bag of words)? Can you point me to the relevant caffe layers and any article references?

Thanks! V

beniz commented 9 years ago

@vasants thanks.

BOW is build-in the 'txt' connector, a tutorial for training from text is available here: http://www.deepdetect.com/tutorials/txt-training/

W2V is not yet built-in but can be used as well, though a bit less easily. Here is an application to real data: https://github.com/beniz/quick_cdiscount

In practice my experience is that W2V accuracy is often below that of BOW, and this is corroborated by http://arxiv.org/abs/1509.01626.

Since W2V is however very useful in some settings and typically when the dimensionality of BOW is too high to be optimized easily, it is my plan to include it into the text connector at some point. Let me know if built-in W2V is a feature of interest.

vasants commented 9 years ago

@beniz great!

Yep! I would be interested in W2V (atleast for comparison purposes), but I will check out the less easier way and see if there are any benefits (If I do build W2V out, will send you a pull request).

Regarding caffe layers - I see you have a custom Caffe version running. Do you use any special layers for processing text? Do you info on accuracies you see with standard datasets (test results for the new20 dataset)?

Just trying to get a feel for the type of convnet implemented and any info regarding that would be helpful.

beniz commented 9 years ago

No special layers yet. Conv1d at character level is coming up for my own purpose and I ll report somewhere.

Built-in text support is all word based, no ngrams. Tfidf is implemented but results are poor very much due to the lack of rescaling. If you want to test the later, the rescaling code of the CSV connector could be imported (or even better, reused).

Results with BOW are on par with random forests and NB on a range of mid size datasets I grew over the year.

beniz commented 9 years ago

Also beware that the w2v C++ implementation I've pointed you to is GPL and can't really be linked up as is, unfortunately.

vasants commented 9 years ago

Yup! Thanks!