Closed jqnatividad closed 1 year ago
Hi @jqnatividad, this sounds interesting and probably challenging. Ccould you please provide some concrete priority usecases for the train
command. Which problems are you trying to solve first?
Hi @mhuang74 , thanks for taking an interest on this. And yes, what makes it challenging, makes it interesting! :)
Right now, I'm working on #189 in a CKAN installation and I'd like to use classify
to auto-tag datasets by scanning columns selected by the data publisher (and I'm leveraging the work you did on schema
to create Smart Data Dictionaries too!)
Of course, we first need to train tag classifiers to do this. For qsv, I think that means the train
command will take some training data for a given tag, and storing it (probably for the MVP, a sqlite db using Diesel. Using Diesel since we may want to have multiple qsv instances working with a classifier db that'll require a multi-user db like postgres/mysql) with some additional classifier metadata (e.g. group, topic, etc.).
The classify
command can then use this training data to scan a given column of another CSV and return candidate tags as key-value pairs sorted by confidence over a given confidence threshold (default-70%) in another column as a semicolon delimited list (e.g. "renewable energy,0.95;solar power,0.91;photovoltaic,0.9;")
The classify
command should have an option to select which classifiers to use - all, a list, grouped by classifier metadata.
WDYT?
Okay, sounds like we want to do fast multi-label text classification.
Agree that managing training data and training process is important. but can it be simplified down to:
qsv classify --train mydata_train.csv --select A --tag B --model mymodel
qsv classify --model mymodel newdata.csv --select C --new-column Category
And the omikuji crate that implements the Parabel Partitioned Label Trees also looks quite attractive.
Thoughts?
That is a better approach.
And yeah, no DB is good too - storing on github will certainly work. In future iterations, we can even look into using CKAN as the model hub, as we can then add arbitrary metadata/grouping to it.
omikuji is a great find! It's a little bit dense for me to totally grok right now though, but it seems tailor-made for the job.
Shall I assign it to you @mhuang74 !? 😉
@jqnatividad Yes, please. Having good training data is important. Could you please suggest a few good labeled datasets.
Great!
As for training data, can't we use the Extreme Classification Repository for that - http://manikvarma.org/downloads/XC/XMLRepository.html#provenance?
You may also want to check https://www.climatetagger.net/ and use their API for testing, and perhaps, to create training data as well. FYI, their CKAN climate tagger is not that useful as it only scans dataset metadata, not the dataset.
Stale issue message
Okay, sounds like we want to do fast multi-label text classification.
Agree that managing training data and training process is important. but can it be simplified down to:
qsv train mydata_train.csv --select A --tag B --model mymodel
qsv classify --model mymodel newdata.csv --select C --new-column Category
Thoughts?
On Thu, Mar 17, 2022 at 8:06 PM Joel Natividad @.***> wrote:
Hi @mhuang74 https://github.com/mhuang74 , thanks for taking an interest on this. And yes, what makes it challenging, makes it interesting! :)
Right now, I'm working on #189 https://github.com/jqnatividad/qsv/issues/189 in a CKAN installation and I'd like to use classify to auto-tag datasets by scanning columns selected by the data publisher (and I'm leveraging the work you did on schema to create Smart Data Dictionaries too!)
Of course, we first need to train tag classifiers to do this. For qsv, I think that means the train command will take some training data for a given tag, and storing it (probably for the MVP, a sqlite db using Diesel https://github.com/diesel-rs/diesel. Using Diesel since we may want to have multiple qsv instances working with a classifier db) with some additional classifier metadata (e.g. group, topic, etc.).
The classify command can then use this training data to scan a given column of another CSV and return candidate tags as key-value pairs sorted by confidence over a given confidence threshold (default-70%) in another column as a semicolon delimited list (e.g. "renewable energy,0.95;solar power,0.91;photovoltaic,0.9;")
The classify command should have an option to select which classifiers to use - all, a list, grouped by classifier metadata.
WDYT?
— Reply to this email directly, view it on GitHub https://github.com/jqnatividad/qsv/issues/188#issuecomment-1070843121, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ6WZ5UGH5PUCT24ZEO4GTVAMN23ANCNFSM5QGOGEWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
-- Regards, Michael
As for me, I will always have hope; I will praise you more and more.
Ooops! Sorry @mhuang74 I missed this reply!
And yeah, your approach sounds sensible.
Hopefully, you still have time to implement this. Your past contributions have been valuable additions to qsv and we're actively using it in production pipelines!
P.S. I stumbled on your blogpost too! 😄
https://michaelhuang.xyz/posts/qsv-lessons/
The quality of the code belies your "newbieness" and I learned a lot from it too!
Stale issue message
The
classify
command will scan a given column of a CSV file (typically, a free text field) and add tags to a new column based on the chosen pre-trained classifiers created by thetrain
command.Some good resources: https://www.freecodecamp.org/news/implement-naive-bayes-with-rust/ https://athemathmo.github.io/2016/04/08/naive-bayes-rusty-machine.html https://github.com/AtheMathmo/rusty-machine