seanlane / seanlane.github.io

Repository for my personal website, made public in February 2018. Source code can be found on the source branch, master branch is used for hosting the built website.
https://sean.lane.sh
MIT License
3 stars 0 forks source link

Comments: PySpark with Latent Dirichlet Allocation #2

Open seanlane opened 8 years ago

seanlane commented 8 years ago

Comments for my blog post on Latent Dirichlet Allocation with PySpark: https://sean.lane.sh/blog/2016/PySpark_and_LDA

DoudouT commented 7 years ago

Hi,

A very nice post! Thank you very much!

Do you know how to get the topic distribution for each training document? I read a bit on https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda. Based on that, it should be possible to get the topic distribution for each train document through topTopicsPerDocument. But I got error saying that there is no such attribute.

Cheers, Doudou

seanlane commented 7 years ago

Hi Doudou,

Unfortunately, as of the current version of PySpark (2.1), there isn't a way to do that. However, it is possible with the regular Spark project. PySpark is largely a Python wrapper over the original Apache Spark project, which is written in Scala. You can use the topTopicsPerDocument method found here: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel

And while it currently isn't possible in PySpark, you can follow the same process I've written about in this post in Spark to achieve the same results. Good luck!

Sean Lane

DoudouT commented 7 years ago

Hi Sean Lane,

Thank you very much for your reply!

I tried 'transform' after fitting the lda as described on https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda . It sometimes works but sometimes not. Do you have a clue why it's that?

All the bests, Doudou

On 5 February 2017 at 22:57, Sean Lane notifications@github.com wrote:

Hi Doudou,

Unfortunately, as of the current version of PySpark (2.1), there isn't a way to do that. However, it is possible with the regular Spark project. PySpark is largely a Python wrapper over the original Apache Spark project, which is written in Scala. You can use the topTopicsPerDocument method found here: https://spark.apache.org/docs/latest/api/scala/index.html# org.apache.spark.mllib.clustering.DistributedLDAModel

And while it currently isn't possible in PySpark, you can follow the same process I've written about in this post in Spark to achieve the same results. Good luck!

Sean Lane

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/seanlane/seanlane-comments/issues/2#issuecomment-277557250, or mute the thread https://github.com/notifications/unsubscribe-auth/AYTR8t0rEMPWSjA0Jq4KlW4r3WGGxsW7ks5rZlPogaJpZM4IbEWd .

seanlane commented 7 years ago

It's hard to tell, but I would suggest reading through the errors that occur when the transform method doesn't work, they should help you understand what problems are occurring. Good luck!

mahmoudparsian commented 7 years ago

Hi Sean Lane,

This is a great post and I am thinking about posting the Java version of it in my upcoming book: Data Algorithms, 2nd Edition.

One question: in the last line of your code you refer to "topic_val", which is not defined anywhere in the code. Should that be "topic_indices"?

Thank you, best regards, Mahmoud Parsian

seanlane commented 7 years ago

Hi Mahmoud,

Thank you for pointing that out, it was actually a holdover from an iteration on this code where I was playing with different number of topics for a different dataset. I have corrected the error, along with some other inconsistencies that I noticed, and it should be correct now. Good luck with your book!

Thanks, Sean

mahmoudparsian commented 7 years ago

Thank you very much Sean!

On Jul 6, 2017, at 8:39 PM, Sean Lane notifications@github.com wrote:

Hi Mahmoud,

Thank you for pointing that out, it was actually a holdover from an iteration on this code where I was playing with different number of topics for a different dataset. I have corrected the error, along with some other inconsistencies that I noticed, and it should be correct now. Good luck with your book!

Thanks, Sean

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/seanlane/seanlane-comments/issues/2#issuecomment-313579125, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdE0DgWe5jmdSrgWygtqJwp3WDyREm_ks5sLahzgaJpZM4IbEWd.