The following contains a high-level view of what will be the next main enhancement steps. This document will be kept up-to-date and improved frequently. This work will be mainly conducted by @mk2510 and @henrifroese as part of their SummerOfCode project.

Version 1.10
- [x] Every representation function to receive as input a TokenSeries #44
- [x] Decouple TF-IDF L2-normalization and TF-IDF #76
- [x] Rename term_frequency to count() + add functionterm_frequency #61
- [x] Introduce HeroSeries
- [x] Add ~ hero.norm(RepresetationSeries, "l1"/"l2")
- [x] Can we avoid the use of VectorSeries/TokenSeries?
- [x] All representation functions to deal with HeroSeries + (DocumentTermDF) #43
- [ ] Update README + getting-started.md
- [ ] Push a new version to PyPi
Performance: speed-up the library
- [ ] Most of Texthero data structure are list of list ([["a", "document"], ["another", "document"]]), can we leverage parallelization? We can learn from spaCy. Mandatory read: 100-times-faster-nlp; look at this for parallelization
- [ ] Make spaCy function faster + Dask vs Spacy #65
- [ ] Depending on the previous task, evaluate if we want to have as default tokenizer spaCy: #131
Software development:
- [x] Integrate checking for correct Series types (#60, #55, ...)
- [ ] Check hero functions work with np.nan #86
Support Embeddings through Flair
- [ ] Add hero.embed(s, flairEmbedding)
Add Topic Modeling
- [ ] Add topic modeling support under representation #42 This include also "topic modeling visualization" to get insights out of it
- [ ] Add a blog article on how topic modeling with Texthero works
Extra
1. test coverage
2. expand multilingual: more languages; recognize languages and select correct one
3. (low priority) Text summarization (#38) and characteristic terms (#2)

Merge-Plan

[x] #156 (representation series to multicolumn). Branched from Texthero Master
[x] #157 (hero types in representation & DocumentTermDF in _types). Branched from #156 ~#158 (add pandas setitem support for DocumentTermDF). Branched from #156~
[x] #174 (Fix type checks). Branched from Texthero Master
[ ] #117 (getting-started), #118 (README), #135 (getting-started hero-types). Branched from Texthero Master
[ ] RELEASE NEW VERSION
[ ] #146 (Flair Embeddings). Branched from Texthero Master
[x] #160 (Travis Clean-Up). Branched from Texthero Master
[x] #161 (Pre-commit hook). Branched from Texthero Master
[ ] #162 (Speed-Up Preprocessing+NLP). Branched from Texthero Master
[ ] #163 (Topic Modelling w/ Visualizations). Branched from Texthero Master
[x] #165 (Fix term_frequency). Branched from Texthero Master
[ ] #167 (Train-Test Split). Branched from Texthero Master
[ ] #168 (Describe DF). Branched from Texthero Master
[ ] #169 (filter extremes). Branched from Texthero Master
[ ] #170 (ClusterSeries Type). Branched from Texthero Master
[ ] #175 (Visualization Tutorial). Branched from Texthero Master
[ ] #176 (NLP Tutorial). Branched from Texthero Master
[ ] #177 (Show DataFrame). Branched from Texthero Master
[ ] #178 (Visualize Describe DF). Branched from #168

jbesomi / texthero

👩‍💻 API next steps: checklist #85

Merge-Plan