section-engineering-education / engineering-education

“Section's Engineering Education (EngEd) Program is dedicated to offering a unique quality community experience for computer science university students."
Apache License 2.0
363 stars 889 forks source link

[Machine Learning] Building a Multi-class Text Classification model using H2O and Sckit-learn #6725

Closed charles721 closed 2 years ago

charles721 commented 2 years ago

Proposal Submission

Proposed title of article

[Machine Learning] Building a Text Classification model using H2O and Sckit-learn

Proposed article introduction

Text classification is a machine learning technique that assigns a set of predefined categories to open-ended text. Based on the incoming text we create a model to learn on the target label and finally predict on the target label. Text classification is one of the fundamental tasks in natural language processing with applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

Multiclass Classification is a classification task with more than two classes. For example classify a set of images of fruits which may be oranges, apples, or pears. Multi-class classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.

In this tutorial, we will build a model, that classifies Consumer Finance Complaints into 5 pre-defined classes. The Consumer Complaint Database is a collection of complaints about consumer financial products and services. We will use Scikit-learn for text preprocessing and vectorization. We will use H2O to automate the NLP process using a pipeline. H2O library will select the best algorithm and perform the model evaluation.

Key takeaways

  1. What is multi-class text classification?
  2. Applications of text classification such as sentiment analysis, topic labeling, spam detection, and intent detection.
  3. Text pre-processing (stemming and removing stop words)
  4. Text vectorization using TfidfVectorizer.
  5. Using H20 AutoML to select the best algorithm.
  6. Using the selected best algorithm to train a customer complaints model.

Article quality

This tutorial is unique because will integrate Scikit-learn and H2O in building the text classification model. We will use Scikit-learn for text preprocessing, text cleaning, and vectorization. H20 will create a pipeline that automates the process of model building and select the best algorithm to train a customer complaints model. We will create an end-to-end NLP pipeline starting from cleaning text data, model selection, model evaluation, handling imbalanced datasets.

References

Please list links to any published content/research that you intend to use to support/guide this article.

Conclusion

Finally, remove the Pre-Submission advice section and all our blockquoted notes as you fill in the form before you submit. We look forwarding to reviewing your topic suggestion.

Templates to use as guides

github-actions[bot] commented 2 years ago

👋 @charles721 Good afternoon and thank you for submitting your topic suggestion. Your topic form has been entered into our queue and should be reviewed (for approval) as soon as a content moderator is finished reviewing the ones in the queue before it.