Design of experiment - Githubissues

prasadpn commented 2 years ago

As a ML engineer, i would like to propose the high level ways of implementing the ATA.

prasadpn commented 2 years ago

(1) use of pre-trained glove embeddings. The glove6b.zip has different flavors of dimensions 50D/100D/200D/300D. Common Crawl is another option which can be explored. reference: https://nlp.stanford.edu/projects/glove/

(2) BERT is another pretrained model which can be explored - it has various flavors such as BERT-Base, BERT-Large

(3) RoBERTa by Facebook is another option. RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs. RoBERTa uses 160 GB of text for pre-training, including 16GB of Books Corpus and English Wikipedia used in BERT. The additional data included CommonCrawl News dataset (63 million articles, 76 GB), Web text corpus (38 GB) and Stories from Common Crawl (31 GB). This coupled with whopping 1024 V100 Tesla GPU’s running for a day, led to pre-training of RoBERTa.

(4) ELMO is yet another option which can be explored - Developed in 2018 by AllenNLP, it goes beyond traditional embedding techniques. It uses a deep, bi-directional LSTM model to create word representations. Rather than a dictionary of words and their corresponding vectors, ELMo analyses words within the context that they are used. It is also character based, allowing the model to form representations of out-of-vocabulary words.

ELMO's emphasis on context should ideally make it a better performer than Glove

(5) XLNet is another option

fyi @asjad248 @mdkamal07

prasadpn commented 2 years ago

Another experiment could be the type of models to be used:-

(1) SVM (2) Random Forest (3) Naive Bayes (4) RNN (5) GRU (6) LSTM

fyi @asjad248 @mdkamal07

prasadpn commented 2 years ago

See how model performance vary if input is short description/ long description or combination of both

prasadpn commented 2 years ago

Combining the different groups into sub groups based on clustering and compare the model performance

prasadpn commented 2 years ago

Explore how traditional and complex model performs with stop words and without stop words

prasadpn commented 2 years ago

Approaches towards automatic ticket classification:-

Rule Based Model This method classifies tickets into groups using certain rules. For example, if we consider a Short Description containing "login issue" and see the base data - we find most of the cases belong to Group 0. Hence we may come up with a rule that if text contains "login issue" then assign it to Group 0

Though this may look simpler for a small dataset, but there are disadvantages to this approach:-

It is very time consuming to come up with such rules. Someone has to go through all the tickets and should have good domain knowledge to come up with such rules
this may lead to biasness, as the person creating the rules will give suggestions based on the way they perceive a ticket
this indicates that maintaining such a list will be difficult and to scale up such a system will be difficult (consider a scenario where million tickets are received)
In case we get a text which has never come up in past, Rule Based system will not be able to suggest a new grouping

ML Based Model ML based models eliminates the need of manually creating rules. Instead they classify text based on past observations (labeled training data set). The important aspect of ML based model is proper cleaning of text and then converting those text into a format which machine can understand i.e. in form of vector.

There are various ways to convert text into vector - from as basic as One Hot Encoding, TFIDF to complex embeddings using Glove/Bert.

The vectors are then fed into a classification model. Again this could be a traditional ML Classifier such as SVC, Naive Bayes or Neural network based models such as GRU, LSTM.

There are certain limitations though:

Biasness in various stage of model building..
1. At data collection stage
  - Misclassification bias - tickets tagged to incorrect group
  - Sampling bias - scenario where certain group of population may have a lower/higher probability to be picked up during sampling. Our dataset is imbalanced dataset and there are scenarios where certain groups have one 1 entry.
2. At data processing stage
  - Labeling bias - the dataset is not fully representative of all the labels/groups available in the universe
3. Algorithm building
  - underfitting
  - overfitting Though there are so many biases which can creep up during model development, ML based model is still superior to the rule based model

Hybrid Model Combination of both Rule Based and ML based model.

fyi.. @asjad248 @mdkamal07 @LakshmiDadarkar

prasadpn commented 2 years ago

Performance Metrics

• Accuracy = (TP + TN)/(TP + FP + FN + FP) • Precision = TP/(TP + FP) • Recall = TP/(TP + FN) • F1 score = 2 ∗ (Precision ∗ Recall) / Precision + Recall, where: • TP—True Positive examples are predicted to be positive and are positive; • TN—True Negative examples are predicted to be negative and are negative; • FP—False Positive examples are predicted to be positive but are negative; • FN—False Negative examples are predicted to be negative but are positive;

prasadpn commented 2 years ago

Naive Bayes Classifier:- A multinomial naive bayes classifier can be used to classify tickets into different groups. The model uses frequency of words to calculate probability of the group to which it will belong to. It will not consider the context of the statements.

Here, the ‘naive’ assumption is that every word in a sentence is independent of the other ones and hence the Bayes Theorem could be applied. This assumption will not be true in real world scenario. In addition, the multinomial model makes an assumption of positional independence. (The position of a term in a document by itself does not carry information about the class. Although there is a difference between China sues France and France sues China, the occurrence of China in position 1 versus position 3 of the document is not useful in NB classification because it looks at each term separately. The conditional independence assumption commits to this way of processing the evidence.) However, NB models perform well despite the conditional independence assumption

Even if it is not the method with the highest accuracy for text, NB has many virtues that make it a strong contender for text classification. It excels if there are many equally important features that jointly contribute to the classification decision
it can have decent performance when using fewer than a dozen terms. The most important indicators for a class are less likely to change. Thus, a model that only relies on these features is more likely to maintain a certain level of accuracy

NB's main strength is its efficiency: Training and classification can be accomplished with one pass over the data. Because it combines efficiency with good accuracy it is often used as a baseline in text classification research. It is often the method of choice if (i) squeezing out a few extra percentage points of accuracy is not worth the trouble in a text classification application, (ii) a very large amount of training data is available and there is more to be gained from training on a lot of data than using a better classifier on a smaller training set, or (iii) if its robustness to concept drift can be exploited.

https://nlp.stanford.edu/IR-book/html/htmledition/properties-of-naive-bayes-1.html

The performance of Naive Bayes depends on the accuracy of the estimated conditional probability terms. It is hard to accurately estimate these terms when the training data is scarce.

prasadpn commented 2 years ago

Interdisciplinary Journal of Information, Knowledge & Management Vol 13, 2018

TEXT CLASSIFICATION TECHNIQUES: A LITERATURE REVIEW by M. Thangaraj & M. Sivakami

Method | Advantages | Disadvantages -- | -- | -- Logistic Regression | Simple parameter estimation, works well for categorical predictions. | Requires large sample size, not suitable for non-linear problems, vulnerable to overconfidence. Naïve Bayes | Fast classifier, converges earlier than discriminative models like logistic regression, requires less training, applies for both binary and multi-class problems | Interactions between the features cannot be achieved. The probabilities calculated are not mathematically accurate, but relative probabilities. SVM | Regularization parameter avoids over-fitting. Kernel engineering helps to incorporate expert knowledge. | Selecting the best kernel and time consumed for training and testing. Decision Trees | Simple to understand after providing explanation. Insights based on expert knowledge and dynamic. | Not suitable for multilevel categorical variables, biased information gain, complex for uncertain and multiple valued attributes. K-NN | Simpler implementation, Flexible feature selection, good for multiclass problem | Searching nearest neighbors and estimating optimal k value Artificial Neural Networks | Easier to use, approximates any kind of function, and almost matches human brain | Requires large training and test data, much of the operations are hidden and difficult to increase accuracy.

asjad248 commented 2 years ago

no of group less than 10 records- 25 these can be removed.
no of group greater than 10 records- 48
no of groups in which records between 10 and 100 -37 group
- no of group greater than 100 records-11
we can create Ensemble Model( two model ) 1.with greater than 100 records in each group ( there are 11 such group)
1. with groups in which each group have records between 10 and 100( there are 37 such groups)
for any test ticket coming in - it will be passed to above two model which ever model has highest probability ticket will be assigned to that group

FYI @prasadpn @mdkamal07 @LakshmiDadarkar

prasadpn / Ticket-Assignment

Design of experiment #3