tanussingh / Big-Data-Management-Analytics-Project

Final Project for CS 6350.001 - Large Scale Data Collection and preprocessing in Spark
3 stars 2 forks source link

Integrate Spacy with Spark #6

Closed ishansharma closed 5 years ago

ishansharma commented 5 years ago

Should Spacy be loaded inside Spark job or should this processing happen before we feed to Kafka? (this depends on/affects how we plan to run the job, either a single Spark script of a bunch of Python scripts)

POST-RESEARCH:- If inside spark: helpful link -> https://blog.dominodatalab.com/making-pyspark-work-spacy-overcoming-serialization-errors/

-> seems to be easier if it is handled prior to data being streamed into Kafka.