Open menshikh-iv opened 7 years ago
Hello, can I do it?
@YouYueHuang Hello, yes, but please, before editing, discuss all text first here.
No problem
For now, I have some questions to parts of your to-do list: (1) Developer page: May I ask what are the differences between current state and previous state?
(2) GSOC 2017 project: How would you like to mark it as archive?
(3) Student Projects / Ideas & Feature proposals: @menshikh-iv mentioned you would like to
Here is an expample of table of contents. ( https://github.com/d3/d3/blob/master/API.md#scales-d3-scale )
(4) For roadmap and Recipes & FAQ, how could I know what to write? Is there any preferred format or writing style for you?
@YouYueHuang
(1) I think I'll fill-up it yourself (because now release is my responsibility)
(2) Add prefix [Archive]
(3-4) Need to continue a discussion with @piskvorky here.
@menshikh-iv ok, for now I will focus on Student Projects / Ideas & Feature proposals. I have sent a email to @piskvorky and told him we will discuss the change here.
Hi @menshikh-iv, I can see @piskvorky leave message in other issues, but he did not respond to this one. Could you please tell me the way to contact him? Many thanks.
Hello @YouYueHuang , please follow @menshikh-iv 's instructions here, I have no additional information. Thanks!
@YouYueHuang I'll notify you when I will be ready with detailed plan
sorry, misclick
@menshikh-iv @piskvorky Thanks for updating the information. I will wait for your detailed plan.
@piskvorky @YouYueHuang About features & proposal page + student projects, some ideas
Need to merge features & proposal page + student projects
Need to add more detailed description for all projects (with background, todo, resourses section)
Details
mark as WIP means short line after heading "currently in progress, see PR #... by @..."
project | status | priority | action |
---|---|---|---|
Visualization | already implemented by @parulsethi + last project #1616 | hight | mark as WIP |
Sanity checks | actual always | medium | - |
Model selection | implemented as "side effect" of sklearn-api from @chinmayapancholi13 | - | remove |
Distributed computing | actual | hight | - |
Distributed sim queries | Earlier it was simserver (no longer maintained), for now it's scaletext project | - | remove |
Online NNMF | very relevant for us (and very hard to implement) | hight | merge with description from student project page |
sLDA | WIP by @souravsingh | medium | mark as WIP, merge with description from student project page |
ESA | WIP by @shubhamjain74 | medium | merge with description from student project page |
DTM improvements | implemented as wrapper, more isn't relevant | - | remove |
Nested Hierarchical Dirichlet Processes | already implemented by @olavurmortensen | - | remove |
nHDP | - | low | - |
Pachinko Allocation Model | - | low | add todo section |
Sparse-tool package | need to ask @souravsingh about it | medium | (2) |
GLoVE | - | low | (2) |
WordRank | Implemented as wrapper by @parulsethi | - | remove |
Wrapper for BigARTM | unrelevant | - | remove |
Add Montemurro and Zanette algorithm | - | low | (2) |
VarEmbed | unrelevant for now (fasttext is very similar) | - | remove |
Will be continued soon
@menshikh-iv Sparsetool package has a bit of a complexity associated with it, since we are dealing with C code. There was some progress here- https://github.com/scipy/scipy/pull/7127 but the progress has halted due to test failures. It would be good to revisit this.
@piskvorky, @menshikh-iv This is the simplified version of student project. If there is any feedback, feel free to write it in comment. What I did:
If you'd like to work on any of the topics below, you will contribute a scalable implementation of the algorithms to the data science world in Python. A quality implementation will be widely used in the industry. RaRe-Technologies offer financial reward, technical and academic assistance for the project below.
Goal:
Deliverables
Code: a pull request against gensim on github. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings, memory use and accuracy of your implementation on the English Wikipedia corpus, the Cornell Movie review corpus, the Lee corpus of human similarity judgements, the "20 newsgroups" corpus, or other freely available datasets. A summary of insights into parameter selection and tuning of the model. For distributed-computing-based projects, how performance changes by adding cores and machines to the cluster are valued in particular.
Project | Background | Status |
---|---|---|
Online NNMF (related to Online NNMF) |
|
|
Explicit Semantic Analysis (related to ESA) |
|
|
Supervised Latent Dirichlet Allocation (related to Supervised LDA) |
|
Consider integration with existing Python sLDA |
Word Movers Distance for word2vec |
|
Already being worked on by @RishabGoel |
Author-Topic Models |
|
Already being worked on by @olavurmortensen |
Distributed computing for Latent Dirichlet Allocation (related to distributed-computing) |
|
|
Distributed computing for Latent Semantic Indexing (related to distributed-computing) |
|
|
Distributed computing for word2vec (related to distributed-computing) |
|
@souravsingh, how would you like to add it into feature proposal list? Could you give me an explicit direction to help you?
We have several pages in wiki part, but a huge part of the information is outdated. Need to fix it.