Related work overview (סקירה ספרותית)

tomerm commented 5 years ago

Hi Efrat (@efrathason). Very fresh draft version is attached in a Word file. Please share your comments / questions / concerns / reservationss. I will try to do my best to relate to them in a timely manner.

OverviewDraft_2Oct2018.zip

tomerm commented 5 years ago

I am a bit lost from all the feedback I am getting via various channels. I decided to summarize them here:

Maturity of model - this is usually estimated based on the evaluation metrics (which will be calculated of course). Not sure if you had anything else in mind by "maturity"
Runtime - as we discussed not once, we will provide:

Hardware profile of machine on which experiments are being executed
Runtime measures (CPU time , RAM consumption etc.) for data sets that we will provide

The way training algorithm supports addition of new category - since we are talking about unsupervised learning, each time you train a model (from scratch) you have well defined list of categories. If this changes and new categories are relevant to materials already participated in the training, there is no way but train a new model from scratch (moreover you need to revise your training materials and tag appropriate documents with new category when appropriate). If new categories are orthogonal to all materials used so far for training, it should not be a problem to continue to train already available model with new materials.
The way the model can undergo additional training. You will see an example with word embedding mode. You will get a word embedding model trained on wikipedia which you will be able to additionally train based on your own data sets.
How much time it takes to train a model for data set of size X - again this is empirical data which we will share from our side. We can't speculate in general or provide any general guidance here. It simply does not make much sense.
How is it possible to improve the model ? What are the methods which can be applied for each algorithm (especially in cases when there is not enough data to perform supervised training) ?

In case there is no sufficient data, I would suggest to consider unsupervised methods as well (even though we know for sure that those are less accurate than supervised one in this context) - https://towardsdatascience.com/comparing-the-performance-of-non-supervised-vs-supervised-learning-methods-for-nlp-text-805a9c019b82
There are various approaches to improve the model. For example usage of word embedding usually improve the performance by 1-2 %. Using better tokenizer (instead of Open Source ones) can also improve results. Taking into account additional linguistic features (i.e. POS ) may improve results.

Maintenance / retraining / additional training of models - does it mean that each time a new category / label pops up we need to keep all the data and train from scratch ? Is it possible to leverage initial model (trained on www data sets ) and pass it through additional training using customer data sets ? - as I mentioned above you will see it with word embedding model.

tomerm commented 5 years ago

Universal language model described in:

The articles are very promising. The code / models (for English) are accessible via:

This can be definitely one of the direction which we can explore for improvement (@semion1956 FYI). However, please observe that multi label classification task is not explicitly called out as one of NLP task which will for sure benefit from this approach. Thus there is no guarantee that it will bring improvement.

https://medium.com/jatana/report-on-text-classification-using-cnn-rnn-han-f0e887214d5f?source=userActivityShare-8a987d9f2a42-1533280457 this article which you mentioned earlier is talking about single label classification use case. If I understand correctly the case we are covering is multi-label classification. Thus it is not very relevant. Better reference for multi label classification (also having some source code) is: https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5

Regarding SOTA results for some specific NLP tasks I would suggest to look at:

http://blog.aylien.com/a-review-of-the-recent-history-of-natural-language-processing/
https://aclweb.org/aclwiki/State_of_the_art Just a reminder - those are most probably English SOTA results (don't look for numbers there. They mention the most promising approach). Usually for languages different from English results are inferior. I am not aware of any similar sites for Arabic language.

tomerm commented 5 years ago

Uploading an updated version of review: OverviewDraft_28Oct2018.zip

I on purpose don't elaborate in the area of algorithms since it is not clear which ones we will implement / use. Thus I provide more details first and foremost on the points which are for sure relevant.

tomerm commented 5 years ago

Uploading slightly updated version of review. OverviewDraft_10Nov2018.zip

Specifically adding this diagram of various word embedding models:

tomerm commented 5 years ago

What is being suggested via: https://github.com/tomerm/MLClassification/blob/master/wordEmbedding/Create%20Word2Vec%20model.ipynb is gensim implementation of Word2Vec (based on CBOW algorithm). Training is done on Arabic wikipedia + sample Arabic news data set.

tomerm / MLClassification

Related work overview (סקירה ספרותית) #2