scicloj / scicloj.ml.top2vec

Use top2vec model from Clojure
Eclipse Public License 1.0
4 stars 0 forks source link

Clojars Project

scicloj.ml.top2vec

Adapter to run Python top2vec topic model in scicloj.ml

General setup

You needed to setup a Clojure repl with:

Docker based setup

I provied here a Dockerfile which does the above instalation correctly. Using this, a working repl running in Docker can be started with:

docker run -ti -v $HOME/.m2:/home/user/.m2 -v "$(pwd):/app" -p  12345:12345 -w /app  scicloj.ml.top2vec  python3  -c "import cljbridge;cljbridge.init_clojure_repl(port=12345,bind='0.0.0.0')"

Then the followin code trains the top2vec model on some texts.


  (require '[clojure.test :refer :all]
            '[scicloj.ml.top2vec :refer :all]
            '[camel-snake-kebab.core :as csk]
            '[tablecloth.api :as tc])

 (def raw-data
        (tc/dataset "https://github.com/scicloj/scicloj.ml.smile/blob/main/test/data/reviews.csv.gz?raw=true"
                    {:key-fn csk/->kebab-case-keyword
                     :file-type :csv
                     :gzipped? true}))
  (def data
        (-> raw-data
            (tc/shuffle {:seed 123})
            (tc/head 10000)
            (tc/select-columns :text)
            tc/drop-missing))

   (def train-result-learn
        (scicloj.metamorph.ml/train data {:speed :learn
                                          :model-type :top2vec
                                          :min_count 1
                                          :documents-column :text}))
 (clojure.pprint/pprint (update-in train-result-learn [:model-data] dissoc :model-as-bytes))
 (def top2vec-model-py (scicloj.metamorph.ml/thaw-model train-result-learn))

The obtained top2vec-model-py is the python object of the trained model. It can be used from Clojure via libpython-clj calls of its API: https://top2vec.readthedocs.io/en/latest/api.html

For a few cases I provide wrappers for the python API. A wordcloud of a topic (the first this case) can be obtained as a SVG string by:

(wc->svg top2vec-model-py (first (get-all-word-scores top2vec-model-py)) 100 100)

image

License

Copyright © 2021 Carsten Behring

EPLv1.0 is just the default for projects generated by clj-new: you are not required to open source this project, nor are you required to use EPLv1.0! Feel free to remove or change the LICENSE file and remove or update this section of the README.md file!

Distributed under the Eclipse Public License version 1.0.