scikit-adaptation / skada

Domain adaptation toolbox compatible with scikit-learn and pytorch
https://scikit-adaptation.github.io/
BSD 3-Clause "New" or "Revised" License
45 stars 16 forks source link

Consider adding tabular dataset #19

Open kachayev opened 7 months ago

kachayev commented 7 months ago

For example, the one with personal/business flights I've been experimenting with.

It would be nice to have more than CV provided out-of-the box.

YanisLalou commented 5 months ago

Just to be clear. The goal here is to create a file like _office.py to download/process a tabular dataset, right ?

tgnassou commented 5 months ago

Exactly, but I think we want an easy one, which is a SOTA dataset I would say. Maybe we need to check the paper if we can find the more popular one. And in a second time, we will add more complex datasets to a bench_skada repo.

kachayev commented 5 months ago

I would say that, first off, we need to pick a suitable tabular dataset for Domain Adaptation (DA). I've had some preliminary results with this dataset: Airline Passenger Satisfaction on Kaggle. It's perfect for our needs because you can easily differentiate between personal and business flights, giving us a clear source vs. target scenario.

Next up, let's create a concise tutorial. The goal here is to demonstrate how the performance of a classifier, trained on one domain, tends to decline when applied to another, and, how to enhance this using DA techniques. At this stage there's no need to worry dataset processing, we're talking about maybe 10-20 lines of code to download and cleanup the dataset.

Once we have this in place, our next step is to package the dataset in a user-friendly way, similar to what we've seen with Office31. This way, it's ready to roll 'out-of-the-box' for anyone installing our library.

Why this order? Well, it's crucial to ensure that our chosen dataset fits DA library needs.

Let me know what do you think.

YanisLalou commented 5 months ago

About the dataset choice, at first glance the Airline Passenger Satisfaction on Kaggle one has no license defined, no authors name, no DOI. Thus we don't even know if its open source or not. At first we wanted to select one of the dataset used in this paper: https://arxiv.org/pdf/2312.07577.pdf These datasets are all open source and there's also benchmarks with them. However we havent decided yet which one we're going to add to skada at first. Maybe the one with the most citations? The one who seems to have the best accuracy results in the benchmarks with DA methods?

kachayev commented 5 months ago

Oh, interesting. I haven't seen this paper yet. Were you able to re-run their experiments to verify results?

YanisLalou commented 5 months ago

I don't think we've tried to reproduce results and don't know if we plan to do it

tgnassou commented 5 months ago

It is a distribution shift tabular dataset, but they don't use any domain adaptation method in their benchmark :( So, I didn't try to reproduce the code. But it will be interesting for the benchmark

kachayev commented 5 months ago

they don't use any domain adaptation method in their benchmark

Yeah... whichever dataset we choose, it's essential to ensure that we can showcase the use of DA methods.