section-engineering-education / engineering-education

“Section's Engineering Education (EngEd) Program is dedicated to offering a unique quality community experience for computer science university students."
Apache License 2.0
363 stars 890 forks source link

Multiclass Text Classification with PySpark #3222

Closed jamesomina99 closed 3 years ago

jamesomina99 commented 3 years ago

Proposed title of article

How to build Multiclass Text Classification model with PySpark

Introduction paragraph (2-3 paragraphs):

PySpark is an interface for Apache Spark in Python. It allows us to write Spark applications using Python APIs and provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning), and Spark Core. We shall use the Pyspark.ML API which is based on the DataFrame API to build our text classification app. In the tutorial, we will use PySpark to create a pipeline to analyze our dataset and create a classifier app. The text classifier that we build in this tutorial can predict the subject category of udemy courses based on what the user inputs. Spark Machine Learning Pipelines API includes three steps:

  1. regexTokenizer: Tokenization (with Regular Expression)
  2. stopwordsRemover: Remove Stop Words
  3. countVectors: Count vectors (“document-term vectors”)

Key takeaways:

  1. What PySpark and how to Install PySpark
  2. Working with Datasets using PySpark
  3. Pyspark. ML API pipeline to create the text classification of our model.
  4. Training, testing, and making predictions.

References:

Please list links to any published content/research that you intend to use to support/guide this article.

Templates to use as guides

hectorkambow commented 3 years ago

@jamesomina99 Good afternoon and thank you for submitting your topic suggestion. Your topic form has been entered into our queue and should be reviewed (for approval) as soon as a content moderator is finished reviewing the ones in the queue before it.

lalith1403 commented 3 years ago

Hello, @jamesomina99.This article is a super helpful and important topic. Please start working on the topic proposed. Let's ignite all the boosters on this one and deliver superior value to the reader.

Let's be sure to provide value that we as developers feel is scarcely available out there on the web. Custom projects would be the best way to explain such concepts. We avoid projects and explanations easily available on documentation sites and blogs.

Cheers, Lalith