thoth-station / ps-nlp

This is a repository for a Predictive Stack for Natural Language Processing (NLP)
GNU General Public License v3.0
1 stars 9 forks source link

Create natural language processing image #3

Closed pacospace closed 3 years ago

pacospace commented 3 years ago

Is your feature request related to a problem? Please describe. As Data Scientist working on NLP,

I want to have an image with some specific libraries for my NLP project.

Describe the solution you'd like

o   nltk

o   Gensim

o   spaCy

o   pytorch

o   tensorflow

o   keras

o   scikit-learn

o   pandas

o   numpy

o   matplotlib

nice to have:

o   huggingface
o   transformers

Describe alternatives you've considered

Additional context Question: All three ML framework in the same image? Or different images NLP-pytroch, NLP-tensorflow, NLP-scikit-learn

cc @harshad16

goern commented 3 years ago

/kind feature /priority important-soon

ViitasaariVille commented 3 years ago

"All three ML framework in the same image? Or different images NLP-pytroch, NLP-tensorflow, NLP-scikit-learn" --> In my experience people have often different python virtual environments for pytorch, tensorflow etc. so it would probably make sense to have these as different images.

pacospace commented 3 years ago

"All three ML framework in the same image? Or different images NLP-pytroch, NLP-tensorflow, NLP-scikit-learn" --> In my experience people have often different python virtual environments for pytorch, tensorflow etc. so it would probably make sense to have these as different images.

That was my thought @ViitasaariVille, thanks for answering, we will proceed to create three overlays for nlp for the three images!

if more combinations are required is not a problem with the architecture we have for the builds.

goern commented 3 years ago

seeAlso https://github.com/thoth-station/core/pull/290

pacospace commented 3 years ago

Hi @ViitasaariVille, for spacy and nltk, do you need some trained language models and data already available in the image?

for example for spacy, include english trained model and for nltk include the different models for chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers, etc..

ViitasaariVille commented 3 years ago

Hi @pacospace, I'm not an expert on NLP stuff but asked my colleague and he thinks this one would be generally useful: https://www.nltk.org/_modules/nltk/tokenize/punkt.html. And NLTK's SnowballStemmer has been useful at least for me in the past when doing tf-idf, text classifications (I've been using SnowballStemmer(language='finnish')), topic analysis etc. I'm guessing small language models for all possible languages would be generally useful as you describe above: "for nltk include the different models for chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers, etc..". Then again e.g. BERT models for several languages are just too large to include in an image. We will probably be using https://github.com/TurkuNLP/FinBERT which is a BERT model for Finnish language (not financial BERT :) ) but we'll be uploading these separately into an OCS s3 bucket.

pacospace commented 3 years ago

NLP Images created ps-nlp, basic NLP Image, ps-nlp-pytorch and ps-nlp-tensorflow, README: https://github.com/thoth-station/ps-nlp.

In the README, you can find descriptions of packages in each image and you have also links to quay images.

Feel free to test them and please let us know if they match all requirements, otherwise, feel free to open more issues/features in this repo and we will improve them 🙂