`classify` and `train` commands

jqnatividad commented 2 years ago

The classify command will scan a given column of a CSV file (typically, a free text field) and add tags to a new column based on the chosen pre-trained classifiers created by the train command.

Some good resources: https://www.freecodecamp.org/news/implement-naive-bayes-with-rust/ https://athemathmo.github.io/2016/04/08/naive-bayes-rusty-machine.html https://github.com/AtheMathmo/rusty-machine

mhuang74 commented 2 years ago

Hi @jqnatividad, this sounds interesting and probably challenging. Ccould you please provide some concrete priority usecases for the train command. Which problems are you trying to solve first?

jqnatividad commented 2 years ago

Hi @mhuang74 , thanks for taking an interest on this. And yes, what makes it challenging, makes it interesting! :)

Right now, I'm working on #189 in a CKAN installation and I'd like to use classify to auto-tag datasets by scanning columns selected by the data publisher (and I'm leveraging the work you did on schema to create Smart Data Dictionaries too!)

Of course, we first need to train tag classifiers to do this. For qsv, I think that means the train command will take some training data for a given tag, and storing it (probably for the MVP, a sqlite db using Diesel. Using Diesel since we may want to have multiple qsv instances working with a classifier db that'll require a multi-user db like postgres/mysql) with some additional classifier metadata (e.g. group, topic, etc.).

The classify command can then use this training data to scan a given column of another CSV and return candidate tags as key-value pairs sorted by confidence over a given confidence threshold (default-70%) in another column as a semicolon delimited list (e.g. "renewable energy,0.95;solar power,0.91;photovoltaic,0.9;")

The classify command should have an option to select which classifiers to use - all, a list, grouped by classifier metadata.

WDYT?

mhuang74 commented 2 years ago

Okay, sounds like we want to do fast multi-label text classification.

Agree that managing training data and training process is important. but can it be simplified down to:

provide a single CSV with both input (col A) and manually tagged labels (col B), and just run qsv classify --train mydata_train.csv --select A --tag B --model mymodel
each time it runs, some rows would be randomly withheld from training and used for validation, and it would run some iterations and print out validation scores along the way so user can decide how many iterations to run for
then run classification on new data on column C and output to to new column via qsv classify --model mymodel newdata.csv --select C --new-column Category
everything is done with files at first without need for database; later store models on a model hub (just store on github publically instead of local db?)

And the omikuji crate that implements the Parabel Partitioned Label Trees also looks quite attractive.

Thoughts?

jqnatividad commented 2 years ago

That is a better approach.

And yeah, no DB is good too - storing on github will certainly work. In future iterations, we can even look into using CKAN as the model hub, as we can then add arbitrary metadata/grouping to it.

omikuji is a great find! It's a little bit dense for me to totally grok right now though, but it seems tailor-made for the job.

Shall I assign it to you @mhuang74 !? 😉

mhuang74 commented 2 years ago

@jqnatividad Yes, please. Having good training data is important. Could you please suggest a few good labeled datasets.

jqnatividad commented 2 years ago

Great!

As for training data, can't we use the Extreme Classification Repository for that - http://manikvarma.org/downloads/XC/XMLRepository.html#provenance?

You may also want to check https://www.climatetagger.net/ and use their API for testing, and perhaps, to create training data as well. FYI, their CKAN climate tagger is not that useful as it only scans dataset metadata, not the dataset.

github-actions[bot] commented 1 year ago

Stale issue message

mhuang74 commented 1 year ago

Okay, sounds like we want to do fast multi-label text classification.

Agree that managing training data and training process is important. but can it be simplified down to:

provide a single CSV with both input (col A) and manually tagged labels (col B), and just run qsv train mydata_train.csv --select A --tag B --model mymodel
each time it runs, some rows would be randomly withheld from training and used for validation, and it would run some iterations and print out validation scores along the way so user can decide how many iterations to run for
then run classification on new data on column C and output to to new column via qsv classify --model mymodel newdata.csv --select C --new-column Category

Thoughts?

On Thu, Mar 17, 2022 at 8:06 PM Joel Natividad @.***> wrote:

Hi @mhuang74 https://github.com/mhuang74 , thanks for taking an interest on this. And yes, what makes it challenging, makes it interesting! :)

Right now, I'm working on #189 https://github.com/jqnatividad/qsv/issues/189 in a CKAN installation and I'd like to use classify to auto-tag datasets by scanning columns selected by the data publisher (and I'm leveraging the work you did on schema to create Smart Data Dictionaries too!)

Of course, we first need to train tag classifiers to do this. For qsv, I think that means the train command will take some training data for a given tag, and storing it (probably for the MVP, a sqlite db using Diesel https://github.com/diesel-rs/diesel. Using Diesel since we may want to have multiple qsv instances working with a classifier db) with some additional classifier metadata (e.g. group, topic, etc.).

The classify command can then use this training data to scan a given column of another CSV and return candidate tags as key-value pairs sorted by confidence over a given confidence threshold (default-70%) in another column as a semicolon delimited list (e.g. "renewable energy,0.95;solar power,0.91;photovoltaic,0.9;")

The classify command should have an option to select which classifiers to use - all, a list, grouped by classifier metadata.

WDYT?

— Reply to this email directly, view it on GitHub https://github.com/jqnatividad/qsv/issues/188#issuecomment-1070843121, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ6WZ5UGH5PUCT24ZEO4GTVAMN23ANCNFSM5QGOGEWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Regards, Michael

As for me, I will always have hope; I will praise you more and more.

- Psalm 71:14*

jqnatividad commented 1 year ago

Ooops! Sorry @mhuang74 I missed this reply!

And yeah, your approach sounds sensible.

Hopefully, you still have time to implement this. Your past contributions have been valuable additions to qsv and we're actively using it in production pipelines!

P.S. I stumbled on your blogpost too! 😄

https://michaelhuang.xyz/posts/qsv-lessons/

The quality of the code belies your "newbieness" and I learned a lot from it too!

github-actions[bot] commented 1 year ago

Stale issue message

jqnatividad / qsv

`classify` and `train` commands #188