Closed merveenoyan closed 2 years ago
Copy paste from internal discussions
In transformers for classification you get a float, as for regression btw. My only concern if for models on the Hub for regression that are not marked as regression but classification (because it’s the same pipeline)
transformers pipeline == tag in the Hub for type == widget that is shown to user == Inference API (both internal and community) endpoint that is called in the server == tag used by AutoNLP when exporting == tasks.ts file in hub-docs == task page == .... So renaming things has an impact in many places
Adding @LysandreJik @sgugger @Narsil @abhishekkrthakur as doing this would impact all systems
TL;DR. Do we want to split tabular-classification
into tabular-classification
and tabular-regression
?
Browsing from internal discussions from some months ago https://huggingface.slack.com/archives/C032RD1Q68L/p1647448299982259?thread_ts=1647353473.766889&cid=C032RD1Q68L, I think that's what we wanted to do, but I think we might have lost in in the alignment of the systems https://github.com/huggingface/datasets/pull/4066.
If we do decide to do this, we will need to
api-inference-community
, specially the mappings such as the validation one https://github.com/huggingface/api-inference-community/blob/main/api_inference_community/validation.py#L180. (Curently broken since structured-... is not recognized as a type anymore. api-inference
https://github.com/huggingface/api-inference/search?q=structured
Question on the 2 above: should we just reuse same pipeline under the hood at least for now?If split tabular-classification
into tabular-classification
and tabular-regression
, we should also split for text-classification
, speech-classification
and image-classification
otherwise users will get lost. This is something that can't be done easily in Transformers for backward compatibility reasons, so you will then get something that is inconsistent across the ecosystem, at least until the next major release of Transformers.
Also if you start splitting, why stop there? single-label classification and multi-label classification are as different from one another than classification vs regression.
I would advocate to split tabular-classification
into tabular-classification
and tabular-regression
I might be wrong but i feel like text/audio/image regression is fairly niche, whereas tabular regression like @merveenoyan is basically what you first learn about when starting ML
- Classification: in output you get a categorical variable (type “object” in python)
- Regression: you get a numerical variable
- Above two can be handled with the same widget, where you could output them as strings
BTW the classification widget is not simply the most probable output class, but the distribution of class probabilities:
whereas the regression one would probably be just a number:
(though i admit i'm not sure what the structured-data-classification widget was like, @osanseviero – on https://huggingface.co/osanseviero/wine-quality for instance – i think it was a "table" input and we performed inference on multiple rows and it filled the last column, no?)
(though i admit i'm not sure what the structured-data-classification widget was like, @osanseviero – on https://huggingface.co/osanseviero/wine-quality for instance – i think it was a "table" input and we performed inference on multiple rows and it filled the last column, no?)
Yes, exactly. The widget was pre-filled thanks to https://huggingface.co/osanseviero/wine-quality/blob/main/README.md#L8, which means that model uploaders have an extra responsibility in adding the metadata to give a correct example. You can see an example (although not running) at https://huggingface-widgets.netlify.app/
i think it was a "table" input and we performed inference on multiple rows and it filled the last column, no?)
Yes.
I do agree that regression
vs classification
is not as big of a difference as courses make them up to be.
classification
semantically means we're interested in something like N (the set) and regression
something like R, but to reach a decision, classification
uses always a R output with a threshold anyway (because gradients), even the widget shows that where our results for classification is R displayed (continous values for each class).
In terms of what we should do, I don't think either option is bad (status quo vs split) , we just need to chose and be consistent.
we just need to chose and be consistent
do you mean consistent across our stack of tools (i'm in favor) or consistent across modalities (i personally don't think we need to)
our stack of tools (i'm in favor)
This one
At least we should have a general name that is not about classification (in regression you don't classify things you rather extrapolate to next thing or interpolate to things in between, you don't have class probabilities whatsoever because you don't have classes). It doesn't necessarily have to be splitted into two or anything else, but my concern is, I seriously didn't know that "structured data classification" pipeline/widget could do regression (classification is a very distinct term in data science world, nobody will think of regression when you say classification) as well, it's confusing for users and not good for visibility of our work in this area. We should at least rename it to something more general, as @adrinjalali suggested. For clustering, it depends on different clustering types, you will usually get a plot of reduced dimension, or a dendrogram if it's a hierarchical method, but it is a different discussion.
So if we engineer over input output, you always input a dataset, there's no objection over it. Classification: you get class labels + their probabilities. You can also get the best result with no probability if you want to simplify and put it together with regression and put the class label on widget output. Regression: You get a number. Clustering: You should get a plot of dendrogram or data points with reduced dimensions with PCA.
Aside from engineering, my main concern here is DX. Bare minimum we can do is to come up with a better name that gives the signal that this task is covering both classification and regression. The potential persona for data science definitely thinks there's a big distinction between regression and classification, even if it's put more than it looks like in courses, they learn it over there, I learnt it at university and first thing taught to me was the difference. Even when you go to scikitlearn's main website there's three categories: classification, regression and clustering, it's straight forward, the problems are mainly different according to what you're trying to predict, algorithms also divide same way. If we want to enable better search for people, it's bad that they use one filter for two tasks and have to search a lot (because for tasks we don't have abbreviations in data science, e.g. in text classification, you have multiple abbreviations like mnli, cola for different family of tasks, but it's not the same way here, data is too specific).
This to me is more a matter of DX rather than a technical one. If we want to serve more user persona, we should be meeting them where they are.
In NLP or image, people might not be doing much regression, but statistical modellers do.
As a developer of a model, I should be able to speak with the hub in the language that I'm comfortable with, and if that means I need to tell the hub that my model is a clustering one, a classifier, or a regressor, then I think the hub should understand that.
How we implement that in the backend is a different question. We can choose to reflect what we expose to users, or we can choose to call everything a classifier (which I personally really wouldn't, but I'm not the one developing those parts).
When I first saw our pipelines, I noticed we don't have regression, and I figured that might only be because we're gonna do classification first and we do regression later, otherwise I think we very much need the terminology to be exposed to users.
Side note, given the number of questions on the forum asking how to multi-label classification or regression with text models, I don't think anyone can say that "no one in NLP is doing regression". Same for computer vision, this is a whole lesson of its own in the fastai course for instance.
we should be meeting them where they are.
100% agree, but fortunately we can definitely have multiple names for the same thing. So we could 100% have a single pipeline (defined by I/O) and multiple tasks (specific naming of the pipeline containing information about what is the purpose of the pipeline). I put a section on vocabulary: https://www.notion.so/huggingface2/Pipelines-Guidelines-1de828d4e56e4adb886253440657e13a where I attempted to define both
In transformers
, ner
and token-classification
are actually the same thing. But users expecting a ner
model, can still work use it without even knowing we're using the same code for both.
Maybe we should extend that idea to the hub, where we could also have aliases ? Right now, if I remember correctly we simply advise to use tags for discoverability and keep a single "pipeline_tag" name.
100% agree, but fortunately we can definitely have multiple names for the same thing.
That's pretty much also what I was thinking.
100% agree, but fortunately we can definitely have multiple names for the same thing.
I second this, so many people get confused over the names :)
What's the final agreement on this discussion? Given it was a couple of weeks since its start, it would be nice to reach some consensus since this implies changing a bunch of tags (but we now have PRs on the Hub! :fire:)
consensus is let's split tabular-classification
into tabular-classification
and tabular-regression
no?
SG! @merveenoyan would you like to take a stab at the action items from https://github.com/huggingface/hub-docs/issues/137#issuecomment-1122829619?
Sure! @osanseviero
Closing this as everything's done.
I don’t know if it’s right place to discuss but I kinda have an objection for tabular tasks. I recently opened a PR to rename structured data classification to tabular classification, see here. If we will invest in this I don’t want to change the name of this pipeline for now or find something that covers regression as well (see below).
My main concern is that I looked at structured-data-classification and thought regression couldn't be done with this. First thing you learn in ML101 is the difference between the two, it's too fundamental imo yet can be fixed with a small change.
The taxonomy according to outputs should be like this:
These are three main task types, I wanted to open this discussion to everyone before moving on. So I see three ways: We can either have two different ones, one will output str and other will float or int. this is too much work. We can have two different things that have different names referring to same object/widget for better visibility. We come up with a name that will cover both. (e.g. tabular-classification as suggested by @adrinjalali)
Pinging @lhoestq @osanseviero @julien-c