Open kachayev opened 11 months ago
Just to be clear. The goal here is to create a file like _office.py to download/process a tabular dataset, right ?
Exactly, but I think we want an easy one, which is a SOTA dataset I would say. Maybe we need to check the paper if we can find the more popular one. And in a second time, we will add more complex datasets to a bench_skada repo.
I would say that, first off, we need to pick a suitable tabular dataset for Domain Adaptation (DA). I've had some preliminary results with this dataset: Airline Passenger Satisfaction on Kaggle. It's perfect for our needs because you can easily differentiate between personal and business flights, giving us a clear source vs. target scenario.
Next up, let's create a concise tutorial. The goal here is to demonstrate how the performance of a classifier, trained on one domain, tends to decline when applied to another, and, how to enhance this using DA techniques. At this stage there's no need to worry dataset processing, we're talking about maybe 10-20 lines of code to download and cleanup the dataset.
Once we have this in place, our next step is to package the dataset in a user-friendly way, similar to what we've seen with Office31. This way, it's ready to roll 'out-of-the-box' for anyone installing our library.
Why this order? Well, it's crucial to ensure that our chosen dataset fits DA library needs.
Let me know what do you think.
About the dataset choice, at first glance the Airline Passenger Satisfaction on Kaggle one has no license defined, no authors name, no DOI. Thus we don't even know if its open source or not. At first we wanted to select one of the dataset used in this paper: https://arxiv.org/pdf/2312.07577.pdf These datasets are all open source and there's also benchmarks with them. However we havent decided yet which one we're going to add to skada at first. Maybe the one with the most citations? The one who seems to have the best accuracy results in the benchmarks with DA methods?
Oh, interesting. I haven't seen this paper yet. Were you able to re-run their experiments to verify results?
I don't think we've tried to reproduce results and don't know if we plan to do it
It is a distribution shift tabular dataset, but they don't use any domain adaptation method in their benchmark :( So, I didn't try to reproduce the code. But it will be interesting for the benchmark
they don't use any domain adaptation method in their benchmark
Yeah... whichever dataset we choose, it's essential to ensure that we can showcase the use of DA methods.
For example, the one with personal/business flights I've been experimenting with.
It would be nice to have more than CV provided out-of-the box.