Open tomerm opened 5 years ago
I am a bit lost from all the feedback I am getting via various channels. I decided to summarize them here:
Maturity of model - this is usually estimated based on the evaluation metrics (which will be calculated of course). Not sure if you had anything else in mind by "maturity"
Runtime - as we discussed not once, we will provide:
The way training algorithm supports addition of new category - since we are talking about unsupervised learning, each time you train a model (from scratch) you have well defined list of categories. If this changes and new categories are relevant to materials already participated in the training, there is no way but train a new model from scratch (moreover you need to revise your training materials and tag appropriate documents with new category when appropriate). If new categories are orthogonal to all materials used so far for training, it should not be a problem to continue to train already available model with new materials.
The way the model can undergo additional training. You will see an example with word embedding mode. You will get a word embedding model trained on wikipedia which you will be able to additionally train based on your own data sets.
How much time it takes to train a model for data set of size X - again this is empirical data which we will share from our side. We can't speculate in general or provide any general guidance here. It simply does not make much sense.
How is it possible to improve the model ? What are the methods which can be applied for each algorithm (especially in cases when there is not enough data to perform supervised training) ?
In case there is no sufficient data, I would suggest to consider unsupervised methods as well (even though we know for sure that those are less accurate than supervised one in this context) - https://towardsdatascience.com/comparing-the-performance-of-non-supervised-vs-supervised-learning-methods-for-nlp-text-805a9c019b82
There are various approaches to improve the model. For example usage of word embedding usually improve the performance by 1-2 %. Using better tokenizer (instead of Open Source ones) can also improve results. Taking into account additional linguistic features (i.e. POS ) may improve results.
The articles are very promising. The code / models (for English) are accessible via:
This can be definitely one of the direction which we can explore for improvement (@semion1956 FYI). However, please observe that multi label classification task is not explicitly called out as one of NLP task which will for sure benefit from this approach. Thus there is no guarantee that it will bring improvement.
Regarding SOTA results for some specific NLP tasks I would suggest to look at:
Uploading an updated version of review: OverviewDraft_28Oct2018.zip
I on purpose don't elaborate in the area of algorithms since it is not clear which ones we will implement / use. Thus I provide more details first and foremost on the points which are for sure relevant.
Uploading slightly updated version of review. OverviewDraft_10Nov2018.zip
Specifically adding this diagram of various word embedding models:
What is being suggested via: https://github.com/tomerm/MLClassification/blob/master/wordEmbedding/Create%20Word2Vec%20model.ipynb is gensim implementation of Word2Vec (based on CBOW algorithm). Training is done on Arabic wikipedia + sample Arabic news data set.
Hi Efrat (@efrathason). Very fresh draft version is attached in a Word file. Please share your comments / questions / concerns / reservationss. I will try to do my best to relate to them in a timely manner.
OverviewDraft_2Oct2018.zip