jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.89k stars 239 forks source link

πŸ‘©β€πŸ’» API next steps: checklist #85

Open jbesomi opened 4 years ago

jbesomi commented 4 years ago

The following contains a high-level view of what will be the next main enhancement steps. This document will be kept up-to-date and improved frequently. This work will be mainly conducted by @mk2510 and @henrifroese as part of their SummerOfCode project.


  1. Version 1.10

    • [x] Every representation function to receive as input a TokenSeries #44
    • [x] Decouple TF-IDF L2-normalization and TF-IDF #76
    • [x] Rename term_frequency to count() + add functionterm_frequency #61
    • [x] Introduce HeroSeries
    • [x] Add ~ hero.norm(RepresetationSeries, "l1"/"l2")
    • [x] Can we avoid the use of VectorSeries/TokenSeries?
    • [x] All representation functions to deal with HeroSeries + (DocumentTermDF) #43
    • [ ] Update README + getting-started.md
    • [ ] Push a new version to PyPi
  2. Performance: speed-up the library

    • [ ] Most of Texthero data structure are list of list ([["a", "document"], ["another", "document"]]), can we leverage parallelization? We can learn from spaCy. Mandatory read: 100-times-faster-nlp; look at this for parallelization
    • [ ] Make spaCy function faster + Dask vs Spacy #65
    • [ ] Depending on the previous task, evaluate if we want to have as default tokenizer spaCy: #131
  3. Software development:

    • [x] Integrate checking for correct Series types (#60, #55, ...)
    • [ ] Check hero functions work with np.nan #86
  4. Support Embeddings through Flair

    • [ ] Add hero.embed(s, flairEmbedding)
  5. Add Topic Modeling

    • [ ] Add topic modeling support under representation #42 This include also "topic modeling visualization" to get insights out of it
    • [ ] Add a blog article on how topic modeling with Texthero works
  6. Extra

    1. test coverage
    2. expand multilingual: more languages; recognize languages and select correct one
    3. (low priority) Text summarization (#38) and characteristic terms (#2)
henrifroese commented 4 years ago

Merge-Plan