Open seanlane opened 8 years ago
Hi,
A very nice post! Thank you very much!
Do you know how to get the topic distribution for each training document? I read a bit on https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda. Based on that, it should be possible to get the topic distribution for each train document through topTopicsPerDocument. But I got error saying that there is no such attribute.
Cheers, Doudou
Hi Doudou,
Unfortunately, as of the current version of PySpark (2.1), there isn't a way to do that. However, it is possible with the regular Spark project. PySpark is largely a Python wrapper over the original Apache Spark project, which is written in Scala. You can use the topTopicsPerDocument
method found here: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel
And while it currently isn't possible in PySpark, you can follow the same process I've written about in this post in Spark to achieve the same results. Good luck!
Sean Lane
Hi Sean Lane,
Thank you very much for your reply!
I tried 'transform' after fitting the lda as described on https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda . It sometimes works but sometimes not. Do you have a clue why it's that?
All the bests, Doudou
On 5 February 2017 at 22:57, Sean Lane notifications@github.com wrote:
Hi Doudou,
Unfortunately, as of the current version of PySpark (2.1), there isn't a way to do that. However, it is possible with the regular Spark project. PySpark is largely a Python wrapper over the original Apache Spark project, which is written in Scala. You can use the topTopicsPerDocument method found here: https://spark.apache.org/docs/latest/api/scala/index.html# org.apache.spark.mllib.clustering.DistributedLDAModel
And while it currently isn't possible in PySpark, you can follow the same process I've written about in this post in Spark to achieve the same results. Good luck!
Sean Lane
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/seanlane/seanlane-comments/issues/2#issuecomment-277557250, or mute the thread https://github.com/notifications/unsubscribe-auth/AYTR8t0rEMPWSjA0Jq4KlW4r3WGGxsW7ks5rZlPogaJpZM4IbEWd .
It's hard to tell, but I would suggest reading through the errors that occur when the transform
method doesn't work, they should help you understand what problems are occurring. Good luck!
Hi Sean Lane,
This is a great post and I am thinking about posting the Java version of it in my upcoming book: Data Algorithms, 2nd Edition.
One question: in the last line of your code you refer to "topic_val", which is not defined anywhere in the code. Should that be "topic_indices"?
Thank you, best regards, Mahmoud Parsian
Hi Mahmoud,
Thank you for pointing that out, it was actually a holdover from an iteration on this code where I was playing with different number of topics for a different dataset. I have corrected the error, along with some other inconsistencies that I noticed, and it should be correct now. Good luck with your book!
Thanks, Sean
Thank you very much Sean!
On Jul 6, 2017, at 8:39 PM, Sean Lane notifications@github.com wrote:
Hi Mahmoud,
Thank you for pointing that out, it was actually a holdover from an iteration on this code where I was playing with different number of topics for a different dataset. I have corrected the error, along with some other inconsistencies that I noticed, and it should be correct now. Good luck with your book!
Thanks, Sean
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/seanlane/seanlane-comments/issues/2#issuecomment-313579125, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdE0DgWe5jmdSrgWygtqJwp3WDyREm_ks5sLahzgaJpZM4IbEWd.
Comments for my blog post on Latent Dirichlet Allocation with PySpark: https://sean.lane.sh/blog/2016/PySpark_and_LDA