tanussingh / Big-Data-Management-Analytics-Project

Final Project for CS 6350.001 - Large Scale Data Collection and preprocessing in Spark
3 stars 2 forks source link

Setup Mongo → Kafka → Spark → Kafka → Mongo Streaming #1

Open ishansharma opened 5 years ago

ishansharma commented 5 years ago

This can either be 1 Spark job or different Python script that we can run from bash or a coordinating script.

ishansharma commented 5 years ago

@tanussingh @mavisfrancia Either one or both of you can take this.

As we discussed, different Python Scripts would be easier with a structure like this:

  1. Script 1 reads data from Mongo (we give date as argument)
  2. Script 1 can performs NER/Doc2Vec here and then send it to Kafka stream
  3. Script 2 (spark job) reads from Kafka stream and performs UDPipe operations, does deduplication and writes operation back to another Kafka stream.
  4. Script 3 (another spark job?) reads the output Kafka stream and writes output to another mongo collection.

We can do it either ways.