Open ewmson opened 7 years ago
Just a status update to fill you in - I am currently working on a library to retrieve stock prices, but I should be able to finish that up soon. I will try to put some work into this issue before our meeting on Wednesday. But here's what I know right now:
It appears that the RDD-based Spark MLlib has been moved to "maintenance" mode in favor of the Dataframe-based implementation as the primary API. The RDD-based API is planned to be deprecated in Spark 2.2 and removed in Spark 3.0. Supposedly, DataFrames make for a more user-friendly API than RDDs, especially for machine learning applications. I'll let you know when I start more on this task. It does seem so far that the new DataFrame-based library is better to work with, though.
Note that the cluster we are running on is unfortunately 1.5.0, and it does not look like that will change anytime soon (not under our control). So if there is more support added after 1.5 we most likely do not have access to it :(
After a closer look at the 1.5.0 docs, I think this can be done. I can get started on it. Is there any reason why we aren't also moving the feature generation (Word2Vec) to the DataFrame API too? It looks like there is support for that, too.
No specific reason, just did not want to have too much time spent on implementing it if we ended up not using it.
On Feb 7, 2017 3:06 PM, "Joey Watts" notifications@github.com wrote:
After a closer look at the 1.5.0 docs, I think this can be done. I can get started on it. Is there any reason why we aren't also moving the feature generation (Word2Vec) to the DataFrame API too? It looks like there is support for that, too.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/saurabhc123/cs4624/issues/1#issuecomment-278123864, or mute the thread https://github.com/notifications/unsubscribe-auth/AFvV_87259Jr9wmy5EJTQQAA8VUz4Ctqks5raM6kgaJpZM4L3huC .
Gotcha, makes sense
Do we need non-binary logistic regression? Spark 1.5 Dataframe-based ML libraries only support binary logistic regression.
we would prefer the multiclass one
On Tue, Feb 7, 2017 at 5:15 PM, Joey Watts notifications@github.com wrote:
Do we need non-binary logistic regression?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/saurabhc123/cs4624/issues/1#issuecomment-278159748, or mute the thread https://github.com/notifications/unsubscribe-auth/AFvV_2JctAu7hAlkTxF5sV861Loev3sHks5raO0ZgaJpZM4L3huC .
In that case, I think that we have to keep the existing RDD-based implementation.
I'm good to close this issue if you agree.
I will let @saurabhc123 make the call on if we want it even if it is binary. Do not invest more time into it until we have a clear need/want for it. We will sort it out tomorrow at the meeting once we see what state the project is in.
Can we investigate the use of the Spark ML library (note not MLLib) which is based on data-frames and if it is better than the existing RDD based Spark MLLib implementation.
Also look at the existing design and what would need to change to incorporate the ML classifier if it proves useful. I assume we will need to change from RDDs at the very least.
Data-frames and datasets documentation: http://spark.apache.org/docs/latest/sql-programming-guide.html
Classification using Spark ML http://spark.apache.org/docs/latest/ml-classification-regression.html