Fix irrelevant wiki pages

menshikh-iv commented 7 years ago

We have several pages in wiki part, but a huge part of the information is outdated. Need to fix it.

Word2Vec & Doc2Vec Wishlist - outdated, need to remove
Home - useless, need to remove
Developer page - need to update for current state
GSOC 2017 project ideas - mark as archive (GSOC 2017 was finished)
Student Projects / Ideas & Feature proposals - merge this to one & remove unrelevant & add info about what's already implement
Recipes & FAQ - Need to refactor from start
Roadmap - outdated, need to write new

YouYueHuang commented 7 years ago

Hello, can I do it?

menshikh-iv commented 7 years ago

@YouYueHuang Hello, yes, but please, before editing, discuss all text first here.

YouYueHuang commented 7 years ago

No problem

YouYueHuang commented 7 years ago

For now, I have some questions to parts of your to-do list: (1) Developer page: May I ask what are the differences between current state and previous state?

(2) GSOC 2017 project: How would you like to mark it as archive?

(3) Student Projects / Ideas & Feature proposals: @menshikh-iv mentioned you would like to

merge this to one
remove unrelevant
add info about what's already implement I found the contents in these two pages are quite lengthy, so I would suggest adding a table of contents. How do you think about it?

Here is an expample of table of contents. ( https://github.com/d3/d3/blob/master/API.md#scales-d3-scale )

(4) For roadmap and Recipes & FAQ, how could I know what to write? Is there any preferred format or writing style for you?

menshikh-iv commented 7 years ago

@YouYueHuang (1) I think I'll fill-up it yourself (because now release is my responsibility) (2) Add prefix [Archive] (3-4) Need to continue a discussion with @piskvorky here.

YouYueHuang commented 7 years ago

@menshikh-iv ok, for now I will focus on Student Projects / Ideas & Feature proposals. I have sent a email to @piskvorky and told him we will discuss the change here.

YouYueHuang commented 7 years ago

Hi @menshikh-iv, I can see @piskvorky leave message in other issues, but he did not respond to this one. Could you please tell me the way to contact him? Many thanks.

piskvorky commented 7 years ago

Hello @YouYueHuang , please follow @menshikh-iv 's instructions here, I have no additional information. Thanks!

menshikh-iv commented 7 years ago

@YouYueHuang I'll notify you when I will be ready with detailed plan

menshikh-iv commented 7 years ago

sorry, misclick

YouYueHuang commented 7 years ago

@menshikh-iv @piskvorky Thanks for updating the information. I will wait for your detailed plan.

menshikh-iv commented 7 years ago

@piskvorky @YouYueHuang About features & proposal page + student projects, some ideas

Need to merge features & proposal page + student projects
Need to add more detailed description for all projects (with background, todo, resourses section)
Details

mark as WIP means short line after heading "currently in progress, see PR #... by @..."

project	status	priority	action
Visualization	already implemented by @parulsethi + last project #1616	hight	mark as WIP
Sanity checks	actual always	medium	-
Model selection	implemented as "side effect" of sklearn-api from @chinmayapancholi13	-	remove
Distributed computing	actual	hight	-
Distributed sim queries	Earlier it was simserver (no longer maintained), for now it's scaletext project	-	remove
Online NNMF	very relevant for us (and very hard to implement)	hight	merge with description from student project page
sLDA	WIP by @souravsingh	medium	mark as WIP, merge with description from student project page
ESA	WIP by @shubhamjain74	medium	merge with description from student project page
DTM improvements	implemented as wrapper, more isn't relevant	-	remove
Nested Hierarchical Dirichlet Processes	already implemented by @olavurmortensen	-	remove
nHDP	-	low	-
Pachinko Allocation Model	-	low	add todo section
Sparse-tool package	need to ask @souravsingh about it	medium	(2)
GLoVE	-	low	(2)
WordRank	Implemented as wrapper by @parulsethi	-	remove
Wrapper for BigARTM	unrelevant	-	remove
Add Montemurro and Zanette algorithm	-	low	(2)
VarEmbed	unrelevant for now (fasttext is very similar)	-	remove

Will be continued soon

souravsingh commented 7 years ago

@menshikh-iv Sparsetool package has a bit of a complexity associated with it, since we are dealing with C code. There was some progress here- https://github.com/scipy/scipy/pull/7127 but the progress has halted due to test failures. It would be good to revisit this.

YouYueHuang commented 7 years ago

@piskvorky, @menshikh-iv This is the simplified version of student project. If there is any feedback, feel free to write it in comment. What I did:

I found almost all the goals and deliverables are the same, so I put them in the front.
Some student project has duplicate topic in features & proposal page, and I add the link in the project column of the list.

If you'd like to work on any of the topics below, you will contribute a scalable implementation of the algorithms to the data science world in Python. A quality implementation will be widely used in the industry. RaRe-Technologies offer financial reward, technical and academic assistance for the project below.
- Read this general summary before applying.
- Contact: student-projects@rare-technologies.com

Goal:

Demonstrate understanding theory and practice of following algorithms by describing, implementing and evaluating them.
Implement a streamed model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. By integrating with one of the existing distributed frameworks, it must simultaneously use multiple machines and multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

Code: a pull request against gensim on github. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings, memory use and accuracy of your implementation on the English Wikipedia corpus, the Cornell Movie review corpus, the Lee corpus of human similarity judgements, the "20 newsgroups" corpus, or other freely available datasets. A summary of insights into parameter selection and tuning of the model. For distributed-computing-based projects, how performance changes by adding cores and machines to the cluster are valued in particular.

Project	Background	Status
Online NNMF (related to Online NNMF)	Non-negative matrix factorization, NNMF, is a popular machine learning algorithm, widely used in collaborative filtering and natural language processing. It can be phrased as an online learning algorithm. While implementations of NNMF in Python exist, they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications.
Explicit Semantic Analysis (related to ESA)	Explicit Semantic Analysis is a method of unsupervised document analysis using Wikipedia as a resource. It has many applications, for example event classification on Twitter. While implementations of ESA exist in Python and other languages, they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications.
Supervised Latent Dirichlet Allocation (related to Supervised LDA)	Supervised Latent Dirichlet Allocation (sLDA) is a Natural Language Processing method based on Latent Dirichlet Allocation (LDA). It is used in predicting the number of "Likes" for a post or the number of stars in a movie review. In the vanilla LDA we treat the topic proportions for a text document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic. In Supervised Latent Dirichlet Allocation (sLDA), we add our target variable to the LDA model. For example, the number of stars assigned in a movie review or number of "Likes" of a post.	Consider integration with existing Python sLDA
Word Movers Distance for word2vec	Word2Vec is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen. Many methods are proposed on how to measure distance between sentences in this new vector space. "Word Mover's Distance" (WMD) is a novel distance-between-text-documents measure. It outperforms simple combinations like sum or mean. Visually, the distance between the two documents is the minimum cumulative distance that all words in document A need to travel to exactly match document B. For example, these two sentences are close with respect to WMD even though they only have one word in common: "The restaurant is loud, we couldn't speak across the tabel" and "The restaurant has a lot to offer but easy conversation is not there".	Already being worked on by @RishabGoel
Author-Topic Models	Author-topic model is a Natural Language Processing method that tells us about a person's writing. It can say how diverse is a range of topics covered by one author. It can also compare two authors and say how similar they are. Best implementation is CVB below. The author-topic model adds information about an author into very popular Latent Dirichlet Allocation (LDA) model. While there are academic implementations in Python and other languages, they are very slow for large datasets.	Already being worked on by @olavurmortensen
Distributed computing for Latent Dirichlet Allocation (related to distributed-computing)	Latent Dirichlet Allocation (LDA) is a very popular algorithm for modelling topics of text documents. Modern data mining relies on high-level distributed frameworks like Hadoop, Spark, Celery, Disco, Samza and Ibis.While there are implementations of distributed LDA in Scala over Spark and in other languages, there is no established distributed computing framework that contains an LDA implementation in Python. You will contribute a scalable implementation of distributed LDA to the data science world in Python, building on top of one of the existing distributed frameworks.
Distributed computing for Latent Semantic Indexing (related to distributed-computing)	Latent Semantic Indexing (LSI) is a very popular algorithm for modelling topics of text documents. Modern data mining relies on high-level distributed frameworks like Hadoop, Spark, Celery, Disco, Samza and Ibis. While there are implementations of distributed LSI in Scala over Spark and in other languages, there is no established distributed computing framework that contains an LSI implementation in Python. You will contribute a scalable implementation of distributed LSI to the data science world in Python, building on top of one of the existing distributed frameworks.
Distributed computing for word2vec (related to distributed-computing)	Word2Vec is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen. Modern data mining relies on high-level distributed frameworks like Hadoop, Spark, Celery, Disco, Samza and Ibis. While there are implementations of distributed word2vec in Scala over Spark and in other languages, there is no established distributed computing framework that contains a word2vec implementation in Python. You will contribute a scalable implementation of distributed word2vec to the data science world in Python, building on top of one of the existing distributed frameworks.

YouYueHuang commented 7 years ago

@souravsingh, how would you like to add it into feature proposal list? Could you give me an explicit direction to help you?

piskvorky / gensim

Fix irrelevant wiki pages #1604

Some student project has duplicate topic in features & proposal page, and I add the link in the project column of the list.