revaturelabs / biforce

Biforce is a project conducted by Revature to improve its business decisions via re-examination of existing metrics and investigation into new metrics that will increase value of company assets. The goal is to leverage all relevant technologies to automate the process of data analysis within the business intelligence life cycle conducted on different departments within the company. The objective is to implement efficient algorithms for data processing via tools available within the Hadoop ecosystem that will run on a physical and cloud cluster.
10 stars 9 forks source link

Test Scalability of Spark BatteryAnalysis #118

Open Mwegert opened 5 years ago

Mwegert commented 5 years ago

Expected Behavior

The Spark program should work even if portions of the dataset are too large to fit in memory.

Actual Behavior

We suspect that there may be scalability issues in ModelFunction. We have provided ModelFunctionDataset, which resolves this but ran slower when we tested it. If there are issues, simply replace ModelFunction with ModelFunctionDataset in the Driver program. Look into optimization of the dataset logic in ModelFunctionDataset.

Steps to Reproduce the Problem

  1. Generate gigabytes of data
  2. Test the code as-is on an EMR cluster.
  3. Determine if an error occurs.