Logistic Regrssion Classifier based on Dataframes

ewmson commented 7 years ago

Can we investigate the use of the Spark ML library (note not MLLib) which is based on data-frames and if it is better than the existing RDD based Spark MLLib implementation.

Also look at the existing design and what would need to change to incorporate the ML classifier if it proves useful. I assume we will need to change from RDDs at the very least.

Data-frames and datasets documentation: http://spark.apache.org/docs/latest/sql-programming-guide.html

Classification using Spark ML http://spark.apache.org/docs/latest/ml-classification-regression.html

joeywatts commented 7 years ago

Just a status update to fill you in - I am currently working on a library to retrieve stock prices, but I should be able to finish that up soon. I will try to put some work into this issue before our meeting on Wednesday. But here's what I know right now:

It appears that the RDD-based Spark MLlib has been moved to "maintenance" mode in favor of the Dataframe-based implementation as the primary API. The RDD-based API is planned to be deprecated in Spark 2.2 and removed in Spark 3.0. Supposedly, DataFrames make for a more user-friendly API than RDDs, especially for machine learning applications. I'll let you know when I start more on this task. It does seem so far that the new DataFrame-based library is better to work with, though.

ewmson commented 7 years ago

Note that the cluster we are running on is unfortunately 1.5.0, and it does not look like that will change anytime soon (not under our control). So if there is more support added after 1.5 we most likely do not have access to it :(

joeywatts commented 7 years ago

After a closer look at the 1.5.0 docs, I think this can be done. I can get started on it. Is there any reason why we aren't also moving the feature generation (Word2Vec) to the DataFrame API too? It looks like there is support for that, too.

ewmson commented 7 years ago

No specific reason, just did not want to have too much time spent on implementing it if we ended up not using it.

On Feb 7, 2017 3:06 PM, "Joey Watts" notifications@github.com wrote:

After a closer look at the 1.5.0 docs, I think this can be done. I can get started on it. Is there any reason why we aren't also moving the feature generation (Word2Vec) to the DataFrame API too? It looks like there is support for that, too.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/saurabhc123/cs4624/issues/1#issuecomment-278123864, or mute the thread https://github.com/notifications/unsubscribe-auth/AFvV_87259Jr9wmy5EJTQQAA8VUz4Ctqks5raM6kgaJpZM4L3huC .

joeywatts commented 7 years ago

Gotcha, makes sense

joeywatts commented 7 years ago

Do we need non-binary logistic regression? Spark 1.5 Dataframe-based ML libraries only support binary logistic regression.

ewmson commented 7 years ago

we would prefer the multiclass one

On Tue, Feb 7, 2017 at 5:15 PM, Joey Watts notifications@github.com wrote:

Do we need non-binary logistic regression?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/saurabhc123/cs4624/issues/1#issuecomment-278159748, or mute the thread https://github.com/notifications/unsubscribe-auth/AFvV_2JctAu7hAlkTxF5sV861Loev3sHks5raO0ZgaJpZM4L3huC .

joeywatts commented 7 years ago

In that case, I think that we have to keep the existing RDD-based implementation.

From the docs: "Currently, only binary classification is supported and the summary must be explicitly cast to BinaryLogisticRegressionTrainingSummary."

I'm good to close this issue if you agree.

ewmson commented 7 years ago

I will let @saurabhc123 make the call on if we want it even if it is binary. Do not invest more time into it until we have a clear need/want for it. We will sort it out tomorrow at the meeting once we see what state the project is in.

saurabhc123 / cs4624

Logistic Regrssion Classifier based on Dataframes #1