Closed prasadpn closed 2 years ago
(1) use of pre-trained glove embeddings. The glove6b.zip has different flavors of dimensions 50D/100D/200D/300D. Common Crawl is another option which can be explored. reference: https://nlp.stanford.edu/projects/glove/
(2) BERT is another pretrained model which can be explored - it has various flavors such as BERT-Base, BERT-Large
(3) RoBERTa by Facebook is another option. RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs. RoBERTa uses 160 GB of text for pre-training, including 16GB of Books Corpus and English Wikipedia used in BERT. The additional data included CommonCrawl News dataset (63 million articles, 76 GB), Web text corpus (38 GB) and Stories from Common Crawl (31 GB). This coupled with whopping 1024 V100 Tesla GPU’s running for a day, led to pre-training of RoBERTa.
(4) ELMO is yet another option which can be explored - Developed in 2018 by AllenNLP, it goes beyond traditional embedding techniques. It uses a deep, bi-directional LSTM model to create word representations. Rather than a dictionary of words and their corresponding vectors, ELMo analyses words within the context that they are used. It is also character based, allowing the model to form representations of out-of-vocabulary words.
ELMO's emphasis on context should ideally make it a better performer than Glove
(5) XLNet is another option
fyi @asjad248 @mdkamal07
Another experiment could be the type of models to be used:-
(1) SVM (2) Random Forest (3) Naive Bayes (4) RNN (5) GRU (6) LSTM
fyi @asjad248 @mdkamal07
See how model performance vary if input is short description/ long description or combination of both
Combining the different groups into sub groups based on clustering and compare the model performance
Explore how traditional and complex model performs with stop words and without stop words
Approaches towards automatic ticket classification:-
Though this may look simpler for a small dataset, but there are disadvantages to this approach:-
There are various ways to convert text into vector - from as basic as One Hot Encoding, TFIDF to complex embeddings using Glove/Bert.
The vectors are then fed into a classification model. Again this could be a traditional ML Classifier such as SVC, Naive Bayes or Neural network based models such as GRU, LSTM.
There are certain limitations though:
fyi.. @asjad248 @mdkamal07 @LakshmiDadarkar
Performance Metrics
• Accuracy = (TP + TN)/(TP + FP + FN + FP) • Precision = TP/(TP + FP) • Recall = TP/(TP + FN) • F1 score = 2 ∗ (Precision ∗ Recall) / Precision + Recall, where: • TP—True Positive examples are predicted to be positive and are positive; • TN—True Negative examples are predicted to be negative and are negative; • FP—False Positive examples are predicted to be positive but are negative; • FN—False Negative examples are predicted to be negative but are positive;
Naive Bayes Classifier:- A multinomial naive bayes classifier can be used to classify tickets into different groups. The model uses frequency of words to calculate probability of the group to which it will belong to. It will not consider the context of the statements.
Here, the ‘naive’ assumption is that every word in a sentence is independent of the other ones and hence the Bayes Theorem could be applied. This assumption will not be true in real world scenario. In addition, the multinomial model makes an assumption of positional independence. (The position of a term in a document by itself does not carry information about the class. Although there is a difference between China sues France and France sues China, the occurrence of China in position 1 versus position 3 of the document is not useful in NB classification because it looks at each term separately. The conditional independence assumption commits to this way of processing the evidence.) However, NB models perform well despite the conditional independence assumption
Even if it is not the method with the highest accuracy for text, NB has many virtues that make it a strong contender for text classification. It excels if there are many equally important features that jointly contribute to the classification decision
it can have decent performance when using fewer than a dozen terms. The most important indicators for a class are less likely to change. Thus, a model that only relies on these features is more likely to maintain a certain level of accuracy
NB's main strength is its efficiency: Training and classification can be accomplished with one pass over the data. Because it combines efficiency with good accuracy it is often used as a baseline in text classification research. It is often the method of choice if (i) squeezing out a few extra percentage points of accuracy is not worth the trouble in a text classification application, (ii) a very large amount of training data is available and there is more to be gained from training on a lot of data than using a better classifier on a smaller training set, or (iii) if its robustness to concept drift can be exploited.
https://nlp.stanford.edu/IR-book/html/htmledition/properties-of-naive-bayes-1.html
The performance of Naive Bayes depends on the accuracy of the estimated conditional probability terms. It is hard to accurately estimate these terms when the training data is scarce.
Interdisciplinary Journal of Information, Knowledge & Management Vol 13, 2018
TEXT CLASSIFICATION TECHNIQUES: A LITERATURE REVIEW by M. Thangaraj & M. Sivakami
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Method | Advantages | Disadvantages -- | -- | -- Logistic Regression | Simple parameter estimation, works well for categorical predictions. | Requires large sample size, not suitable for non-linear problems, vulnerable to overconfidence. Naïve Bayes | Fast classifier, converges earlier than discriminative models like logistic regression, requires less training, applies for both binary and multi-class problems | Interactions between the features cannot be achieved. The probabilities calculated are not mathematically accurate, but relative probabilities. SVM | Regularization parameter avoids over-fitting. Kernel engineering helps to incorporate expert knowledge. | Selecting the best kernel and time consumed for training and testing. Decision Trees | Simple to understand after providing explanation. Insights based on expert knowledge and dynamic. | Not suitable for multilevel categorical variables, biased information gain, complex for uncertain and multiple valued attributes. K-NN | Simpler implementation, Flexible feature selection, good for multiclass problem | Searching nearest neighbors and estimating optimal k value Artificial Neural Networks | Easier to use, approximates any kind of function, and almost matches human brain | Requires large training and test data, much of the operations are hidden and difficult to increase accuracy.
As a ML engineer, i would like to propose the high level ways of implementing the ATA.