jekyll / classifier-reborn

A general classifier module to allow Bayesian and other types of classifications. A fork of cardmagic/classifier.
https://jekyll.github.io/classifier-reborn/
GNU Lesser General Public License v2.1
551 stars 109 forks source link

Executable for training with a persistent data store #99

Open Ch4s3 opened 7 years ago

Ch4s3 commented 7 years ago

Per @parkr's idea it might be useful to have an executable that could be used to train and classify inputs for systems using persistent datastores.

ibnesayeed commented 7 years ago

I was thinking about it, but I thought it would be beyond the scope of this gem. Instead a separate repo can be created that uses this gem to facilitate a full blown CLI. Here is how I envision it (assuming that the executable is named classifier):

# Default store: redis://127.0.0.1:6380/0, but customizable using CLI flag such as:
#     --store=redis://user:secret@example.com:6380/2
#     --store=postgresql://user:secret@example.com:6380/5433/classifierdb

$ classifier train {class} {file_path|url|string|STDIN}
# If a file path is given as the last argument then read the content of the file
# If the input is a URL then fetch the content from the URL

# Or automatic batch training based on the sub-folder names
$ classifier train /path/to/training/folder
# Classes can be inferred from the names of the sub-folders of /path/to/training/folder
# Files from each sub-folder can be used as individual training instances
# Some built-in cleaners can be applied (by default or with a flag) such as removing markup if the files are HTML

$ classifier untrain {class} {file_path|url|string|STDIN}

# Or automatic batch untraining based on the sub-folder names
$ classifier untrain /path/to/untraining/folder

$ classifier classify {file_path|url|string|STDIN}

# Or automatic batch classification of files from a directory
$ classifier classify /path/to/data/folder
# => Two columns of output on STDOUT; class name and file path for each file
# Alternatively, the files can be copied/moved in class-named sub-folders of the output directory
$ classifier classify /path/to/data/folder /path/output/base/folder
# Copy /path/to/data/folder/record.txt to /path/output/base/folder/{class}/record.txt

Further to this, a sub-command server can be added to expose these functionalities over HTTP. We can use something like Sinatra for routing.

$ classifier server --namespace=/foo --store=redis://user:secret@example.com:6380/2 --port=2017
# Listening on http://localhost:2017
# GET /foo/train/{class}/{string|url}
# POST /foo/train/{class} [upload_file]
# GET /foo/untrain/{class}/{string|url}
# POST /foo/untrain/{class} [upload_file]
# GET /foo/classify/{string|url}
# POST /foo/classify [upload_file]

Ideally, the training should be done only using POST method, untraining using PUT/PATCH, and classification using GET. However, supplying big text file in the GET path could be tricky. The default value of the namespace could be empty, but having it would allow serving multiple classifiers from the same server. Perhaps, it can also be configured to use specific stores for each namespace.

Additionally, various command like flags can be stored in a config file to read from, but overwritten if supplied from the terminal.

parkr commented 7 years ago

Start small: a simple CLI that can accept arguments and train/untrain/classify. If you find there is a compelling reason to add a web server, then that can be added later. For now, I'd start small and I'd keep the executable in this repo as it provides no added functionality beyond the library's core functions. Branch out once that PoC is done and it has users.

ibnesayeed commented 7 years ago

Note: I missed some important aspects initially, so now I have updated the proposed CLI/server API.

@parkr, I agree that we can start small and branch off later. However, I was worried that unless we make really toy utility, we will have to use some sophisticated CLI library such as Thor that will add unnecessary clutter to this Gem. As far as the server is concerned, I was only trying to lay out the possible API that can be packaged into a binary. This will provide food for thoughts and help us architect the application in a way that can accommodate these use-cases when it gets evolved.