The following contains a high-level view of what will be the next main enhancement steps. This document will be kept up-to-date and improved frequently. This work will be mainly conducted by @mk2510 and @henrifroese as part of their SummerOfCode project.
Version 1.10
[x] Every representation function to receive as input a TokenSeries #44
[x] Decouple TF-IDF L2-normalization and TF-IDF #76
[x] Rename term_frequency to count() + add functionterm_frequency #61
[x] Can we avoid the use of VectorSeries/TokenSeries?
[x] All representation functions to deal with HeroSeries + (DocumentTermDF) #43
[ ] Update README + getting-started.md
[ ] Push a new version to PyPi
Performance: speed-up the library
[ ] Most of Texthero data structure are list of list ([["a", "document"], ["another", "document"]]), can we leverage parallelization? We can learn from spaCy. Mandatory read: 100-times-faster-nlp; look at this for parallelization
[ ] Make spaCy function faster + Dask vs Spacy #65
[ ] Depending on the previous task, evaluate if we want to have as default tokenizer spaCy: #131
Software development:
[x] Integrate checking for correct Series types (#60, #55, ...)
[x] #156 (representation series to multicolumn). Branched from Texthero Master
[x] #157 (hero types in representation & DocumentTermDF in _types). Branched from #156
~#158 (add pandas setitem support for DocumentTermDF). Branched from #156~
[x] #174 (Fix type checks). Branched from Texthero Master
The following contains a high-level view of what will be the next main enhancement steps. This document will be kept up-to-date and improved frequently. This work will be mainly conducted by @mk2510 and @henrifroese as part of their SummerOfCode project.
Version 1.10
TokenSeries
#44term_frequency
tocount()
+ add functionterm_frequency
#61HeroSeries
VectorSeries
/TokenSeries
?representation
functions to deal withHeroSeries
+ (DocumentTermDF) #43Performance: speed-up the library
spaCy
: #131Software development:
Support Embeddings through Flair
Add Topic Modeling
Extra