piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.71k stars 4.38k forks source link

Fix irrelevant wiki pages #1604

Open menshikh-iv opened 7 years ago

menshikh-iv commented 7 years ago

We have several pages in wiki part, but a huge part of the information is outdated. Need to fix it.

YouYueHuang commented 7 years ago

Hello, can I do it?

menshikh-iv commented 7 years ago

@YouYueHuang Hello, yes, but please, before editing, discuss all text first here.

YouYueHuang commented 7 years ago

No problem

YouYueHuang commented 7 years ago

For now, I have some questions to parts of your to-do list: (1) Developer page: May I ask what are the differences between current state and previous state?

(2) GSOC 2017 project: How would you like to mark it as archive?

(3) Student Projects / Ideas & Feature proposals: @menshikh-iv mentioned you would like to

Here is an expample of table of contents. ( https://github.com/d3/d3/blob/master/API.md#scales-d3-scale )

(4) For roadmap and Recipes & FAQ, how could I know what to write? Is there any preferred format or writing style for you?

menshikh-iv commented 7 years ago

@YouYueHuang (1) I think I'll fill-up it yourself (because now release is my responsibility) (2) Add prefix [Archive] (3-4) Need to continue a discussion with @piskvorky here.

YouYueHuang commented 7 years ago

@menshikh-iv ok, for now I will focus on Student Projects / Ideas & Feature proposals. I have sent a email to @piskvorky and told him we will discuss the change here.

YouYueHuang commented 7 years ago

Hi @menshikh-iv, I can see @piskvorky leave message in other issues, but he did not respond to this one. Could you please tell me the way to contact him? Many thanks.

piskvorky commented 7 years ago

Hello @YouYueHuang , please follow @menshikh-iv 's instructions here, I have no additional information. Thanks!

menshikh-iv commented 7 years ago

@YouYueHuang I'll notify you when I will be ready with detailed plan

menshikh-iv commented 7 years ago

sorry, misclick

YouYueHuang commented 7 years ago

@menshikh-iv @piskvorky Thanks for updating the information. I will wait for your detailed plan.

menshikh-iv commented 7 years ago

@piskvorky @YouYueHuang About features & proposal page + student projects, some ideas

  1. Need to merge features & proposal page + student projects

  2. Need to add more detailed description for all projects (with background, todo, resourses section)

  3. Details

mark as WIP means short line after heading "currently in progress, see PR #... by @..."

project status priority action
Visualization already implemented by @parulsethi + last project #1616 hight mark as WIP
Sanity checks actual always medium -
Model selection implemented as "side effect" of sklearn-api from @chinmayapancholi13 - remove
Distributed computing actual hight -
Distributed sim queries Earlier it was simserver (no longer maintained), for now it's scaletext project - remove
Online NNMF very relevant for us (and very hard to implement) hight merge with description from student project page
sLDA WIP by @souravsingh medium mark as WIP, merge with description from student project page
ESA WIP by @shubhamjain74 medium merge with description from student project page
DTM improvements implemented as wrapper, more isn't relevant - remove
Nested Hierarchical Dirichlet Processes already implemented by @olavurmortensen - remove
nHDP - low -
Pachinko Allocation Model - low add todo section
Sparse-tool package need to ask @souravsingh about it medium (2)
GLoVE - low (2)
WordRank Implemented as wrapper by @parulsethi - remove
Wrapper for BigARTM unrelevant - remove
Add Montemurro and Zanette algorithm - low (2)
VarEmbed unrelevant for now (fasttext is very similar) - remove

Will be continued soon

souravsingh commented 7 years ago

@menshikh-iv Sparsetool package has a bit of a complexity associated with it, since we are dealing with C code. There was some progress here- https://github.com/scipy/scipy/pull/7127 but the progress has halted due to test failures. It would be good to revisit this.

YouYueHuang commented 7 years ago

@piskvorky, @menshikh-iv This is the simplified version of student project. If there is any feedback, feel free to write it in comment. What I did:

  1. I found almost all the goals and deliverables are the same, so I put them in the front.
  2. Some student project has duplicate topic in features & proposal page, and I add the link in the project column of the list.

    If you'd like to work on any of the topics below, you will contribute a scalable implementation of the algorithms to the data science world in Python. A quality implementation will be widely used in the industry. RaRe-Technologies offer financial reward, technical and academic assistance for the project below.

Goal:

Deliverables

Project Background Status
Online NNMF
(related to Online NNMF)
  • Non-negative matrix factorization, NNMF, is a popular machine learning algorithm, widely used in collaborative filtering and natural language processing. It can be phrased as an online learning algorithm.
  • While implementations of NNMF in Python exist, they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications.
Explicit Semantic Analysis
(related to ESA)
  • Explicit Semantic Analysis is a method of unsupervised document analysis using Wikipedia as a resource. It has many applications, for example event classification on Twitter.
  • While implementations of ESA exist in Python and other languages, they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications.
Supervised Latent Dirichlet Allocation
(related to Supervised LDA)
  • Supervised Latent Dirichlet Allocation (sLDA) is a Natural Language Processing method based on Latent Dirichlet Allocation (LDA). It is used in predicting the number of "Likes" for a post or the number of stars in a movie review.
  • In the vanilla LDA we treat the topic proportions for a text document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic. In Supervised Latent Dirichlet Allocation (sLDA), we add our target variable to the LDA model. For example, the number of stars assigned in a movie review or number of "Likes" of a post.
Consider integration with existing Python sLDA
Word Movers Distance for word2vec
  • Word2Vec is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen.
  • Many methods are proposed on how to measure distance between sentences in this new vector space. "Word Mover's Distance" (WMD) is a novel distance-between-text-documents measure. It outperforms simple combinations like sum or mean. Visually, the distance between the two documents is the minimum cumulative distance that all words in document A need to travel to exactly match document B.
  • For example, these two sentences are close with respect to WMD even though they only have one word in common: "The restaurant is loud, we couldn't speak across the tabel" and "The restaurant has a lot to offer but easy conversation is not there".
Already being worked on by @RishabGoel
Author-Topic Models
  • Author-topic model is a Natural Language Processing method that tells us about a person's writing. It can say how diverse is a range of topics covered by one author. It can also compare two authors and say how similar they are.
  • Best implementation is CVB below.
  • The author-topic model adds information about an author into very popular Latent Dirichlet Allocation (LDA) model.
  • While there are academic implementations in Python and other languages, they are very slow for large datasets.
Already being worked on by @olavurmortensen
Distributed computing for Latent Dirichlet Allocation
(related to distributed-computing)
  • Latent Dirichlet Allocation (LDA) is a very popular algorithm for modelling topics of text documents.
  • Modern data mining relies on high-level distributed frameworks like Hadoop, Spark, Celery, Disco, Samza and Ibis.While there are implementations of distributed LDA in Scala over Spark and in other languages, there is no established distributed computing framework that contains an LDA implementation in Python.
  • You will contribute a scalable implementation of distributed LDA to the data science world in Python, building on top of one of the existing distributed frameworks.
Distributed computing for Latent Semantic Indexing
(related to distributed-computing)
  • Latent Semantic Indexing (LSI) is a very popular algorithm for modelling topics of text documents.
  • Modern data mining relies on high-level distributed frameworks like Hadoop, Spark, Celery, Disco, Samza and Ibis. While there are implementations of distributed LSI in Scala over Spark and in other languages, there is no established distributed computing framework that contains an LSI implementation in Python.
  • You will contribute a scalable implementation of distributed LSI to the data science world in Python, building on top of one of the existing distributed frameworks.
Distributed computing for word2vec
(related to distributed-computing)
  • Word2Vec is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen.
  • Modern data mining relies on high-level distributed frameworks like Hadoop, Spark, Celery, Disco, Samza and Ibis. While there are implementations of distributed word2vec in Scala over Spark and in other languages, there is no established distributed computing framework that contains a word2vec implementation in Python.
  • You will contribute a scalable implementation of distributed word2vec to the data science world in Python, building on top of one of the existing distributed frameworks.
YouYueHuang commented 7 years ago

@souravsingh, how would you like to add it into feature proposal list? Could you give me an explicit direction to help you?