huggingface / hub-docs

Docs of the Hugging Face Hub
http://hf.co/docs/hub
Apache License 2.0
303 stars 254 forks source link

Discussion around tabular data related taxonomy #137

Closed merveenoyan closed 2 years ago

merveenoyan commented 2 years ago

I don’t know if it’s right place to discuss but I kinda have an objection for tabular tasks. I recently opened a PR to rename structured data classification to tabular classification, see here. If we will invest in this I don’t want to change the name of this pipeline for now or find something that covers regression as well (see below).

My main concern is that I looked at structured-data-classification and thought regression couldn't be done with this. First thing you learn in ML101 is the difference between the two, it's too fundamental imo yet can be fixed with a small change.

The taxonomy according to outputs should be like this:

These are three main task types, I wanted to open this discussion to everyone before moving on. So I see three ways: We can either have two different ones, one will output str and other will float or int. this is too much work. We can have two different things that have different names referring to same object/widget for better visibility. We come up with a name that will cover both. (e.g. tabular-classification as suggested by @adrinjalali)

Pinging @lhoestq @osanseviero @julien-c

osanseviero commented 2 years ago

Copy paste from internal discussions

In transformers for classification you get a float, as for regression btw. My only concern if for models on the Hub for regression that are not marked as regression but classification (because it’s the same pipeline)

transformers pipeline == tag in the Hub for type == widget that is shown to user == Inference API (both internal and community) endpoint that is called in the server == tag used by AutoNLP when exporting == tasks.ts file in hub-docs == task page == .... So renaming things has an impact in many places

osanseviero commented 2 years ago

Adding @LysandreJik @sgugger @Narsil @abhishekkrthakur as doing this would impact all systems

TL;DR. Do we want to split tabular-classification into tabular-classification and tabular-regression?

Browsing from internal discussions from some months ago https://huggingface.slack.com/archives/C032RD1Q68L/p1647448299982259?thread_ts=1647353473.766889&cid=C032RD1Q68L, I think that's what we wanted to do, but I think we might have lost in in the alignment of the systems https://github.com/huggingface/datasets/pull/4066.

If we do decide to do this, we will need to

sgugger commented 2 years ago

If split tabular-classification into tabular-classification and tabular-regression, we should also split for text-classification, speech-classification and image-classification otherwise users will get lost. This is something that can't be done easily in Transformers for backward compatibility reasons, so you will then get something that is inconsistent across the ecosystem, at least until the next major release of Transformers.

Also if you start splitting, why stop there? single-label classification and multi-label classification are as different from one another than classification vs regression.

julien-c commented 2 years ago

I would advocate to split tabular-classification into tabular-classification and tabular-regression

I might be wrong but i feel like text/audio/image regression is fairly niche, whereas tabular regression like @merveenoyan is basically what you first learn about when starting ML

julien-c commented 2 years ago
  • Classification: in output you get a categorical variable (type “object” in python)
  • Regression: you get a numerical variable
  • Above two can be handled with the same widget, where you could output them as strings

BTW the classification widget is not simply the most probable output class, but the distribution of class probabilities:

image

whereas the regression one would probably be just a number:

image
julien-c commented 2 years ago

(though i admit i'm not sure what the structured-data-classification widget was like, @osanseviero – on https://huggingface.co/osanseviero/wine-quality for instance – i think it was a "table" input and we performed inference on multiple rows and it filled the last column, no?)

osanseviero commented 2 years ago

(though i admit i'm not sure what the structured-data-classification widget was like, @osanseviero – on https://huggingface.co/osanseviero/wine-quality for instance – i think it was a "table" input and we performed inference on multiple rows and it filled the last column, no?)

Yes, exactly. The widget was pre-filled thanks to https://huggingface.co/osanseviero/wine-quality/blob/main/README.md#L8, which means that model uploaders have an extra responsibility in adding the metadata to give a correct example. You can see an example (although not running) at https://huggingface-widgets.netlify.app/

Narsil commented 2 years ago

i think it was a "table" input and we performed inference on multiple rows and it filled the last column, no?)

Yes.

I do agree that regression vs classification is not as big of a difference as courses make them up to be. classification semantically means we're interested in something like N (the set) and regression something like R, but to reach a decision, classification uses always a R output with a threshold anyway (because gradients), even the widget shows that where our results for classification is R displayed (continous values for each class).

In terms of what we should do, I don't think either option is bad (status quo vs split) , we just need to chose and be consistent.

julien-c commented 2 years ago

we just need to chose and be consistent

do you mean consistent across our stack of tools (i'm in favor) or consistent across modalities (i personally don't think we need to)

Narsil commented 2 years ago

our stack of tools (i'm in favor)

This one

merveenoyan commented 2 years ago

At least we should have a general name that is not about classification (in regression you don't classify things you rather extrapolate to next thing or interpolate to things in between, you don't have class probabilities whatsoever because you don't have classes). It doesn't necessarily have to be splitted into two or anything else, but my concern is, I seriously didn't know that "structured data classification" pipeline/widget could do regression (classification is a very distinct term in data science world, nobody will think of regression when you say classification) as well, it's confusing for users and not good for visibility of our work in this area. We should at least rename it to something more general, as @adrinjalali suggested. For clustering, it depends on different clustering types, you will usually get a plot of reduced dimension, or a dendrogram if it's a hierarchical method, but it is a different discussion.

So if we engineer over input output, you always input a dataset, there's no objection over it. Classification: you get class labels + their probabilities. You can also get the best result with no probability if you want to simplify and put it together with regression and put the class label on widget output. Regression: You get a number. Clustering: You should get a plot of dendrogram or data points with reduced dimensions with PCA.

Aside from engineering, my main concern here is DX. Bare minimum we can do is to come up with a better name that gives the signal that this task is covering both classification and regression. The potential persona for data science definitely thinks there's a big distinction between regression and classification, even if it's put more than it looks like in courses, they learn it over there, I learnt it at university and first thing taught to me was the difference. Even when you go to scikitlearn's main website there's three categories: classification, regression and clustering, it's straight forward, the problems are mainly different according to what you're trying to predict, algorithms also divide same way. If we want to enable better search for people, it's bad that they use one filter for two tasks and have to search a lot (because for tasks we don't have abbreviations in data science, e.g. in text classification, you have multiple abbreviations like mnli, cola for different family of tasks, but it's not the same way here, data is too specific).

adrinjalali commented 2 years ago

This to me is more a matter of DX rather than a technical one. If we want to serve more user persona, we should be meeting them where they are.

In NLP or image, people might not be doing much regression, but statistical modellers do.

As a developer of a model, I should be able to speak with the hub in the language that I'm comfortable with, and if that means I need to tell the hub that my model is a clustering one, a classifier, or a regressor, then I think the hub should understand that.

How we implement that in the backend is a different question. We can choose to reflect what we expose to users, or we can choose to call everything a classifier (which I personally really wouldn't, but I'm not the one developing those parts).

When I first saw our pipelines, I noticed we don't have regression, and I figured that might only be because we're gonna do classification first and we do regression later, otherwise I think we very much need the terminology to be exposed to users.

sgugger commented 2 years ago

Side note, given the number of questions on the forum asking how to multi-label classification or regression with text models, I don't think anyone can say that "no one in NLP is doing regression". Same for computer vision, this is a whole lesson of its own in the fastai course for instance.

Narsil commented 2 years ago

we should be meeting them where they are.

100% agree, but fortunately we can definitely have multiple names for the same thing. So we could 100% have a single pipeline (defined by I/O) and multiple tasks (specific naming of the pipeline containing information about what is the purpose of the pipeline). I put a section on vocabulary: https://www.notion.so/huggingface2/Pipelines-Guidelines-1de828d4e56e4adb886253440657e13a where I attempted to define both

In transformers, ner and token-classification are actually the same thing. But users expecting a ner model, can still work use it without even knowing we're using the same code for both.

Maybe we should extend that idea to the hub, where we could also have aliases ? Right now, if I remember correctly we simply advise to use tags for discoverability and keep a single "pipeline_tag" name.

adrinjalali commented 2 years ago

100% agree, but fortunately we can definitely have multiple names for the same thing.

That's pretty much also what I was thinking.

merveenoyan commented 2 years ago

100% agree, but fortunately we can definitely have multiple names for the same thing.

I second this, so many people get confused over the names :)

osanseviero commented 2 years ago

What's the final agreement on this discussion? Given it was a couple of weeks since its start, it would be nice to reach some consensus since this implies changing a bunch of tags (but we now have PRs on the Hub! :fire:)

julien-c commented 2 years ago

consensus is let's split tabular-classification into tabular-classification and tabular-regression no?

osanseviero commented 2 years ago

SG! @merveenoyan would you like to take a stab at the action items from https://github.com/huggingface/hub-docs/issues/137#issuecomment-1122829619?

merveenoyan commented 2 years ago

Sure! @osanseviero

merveenoyan commented 2 years ago

Closing this as everything's done.