microsoft / Semi-supervised-learning

A Unified Semi-Supervised Learning Codebase (NeurIPS'22)
https://usb.readthedocs.io
MIT License
1.33k stars 176 forks source link

Custom Dataset Integration in USB #228

Closed pengyrDL closed 1 month ago

pengyrDL commented 1 month ago

🚀 Feature

Custom Dataset Integration in Unified Semi-supervised Learning Benchmark

Motivation

The current iteration of the USB Unified Semi-supervised learning Benchmark is a valuable resource for researchers and practitioners in the field, providing benchmark datasets that help in comparing and evaluating different semi-supervised learning models effectively. However, the ability to incorporate custom datasets would significantly enhance its utility. Many users work with proprietary or niche datasets tailored to specific problems or industries. The strict focus on pre-defined benchmarks can be limiting, as it does not fully represent the diverse challenges encountered in real-world scenarios. By enabling the use of custom datasets, the USB could become not just a benchmarking tool but also a versatile platform for experimenting with and developing semi-supervised learning models across various domains.

Pitch

I propose extending the functionality of the USB framework to allow users to integrate their own datasets alongside the existing benchmarks. This feature should provide a standardized way to input data, define splits for training, validation, and testing, and ensure compatibility with the semi-supervised learning algorithms already implemented within the USB.

To achieve this, we might need:

Alternatives

An alternative solution might involve creating separate branches or forks of the USB specifically for custom dataset experimentation. While this could provide a workaround, it would not be as seamless or user-friendly as having native support for custom datasets within the main USB platform.

Additional context

Incorporating this feature could increase the adoption of the USB by making it more relevant to a wider range of users. It could also foster a community where sharing and collaboration on various semi-supervised learning problems are encouraged, enhancing the collective knowledge base and potentially leading to advancements in the field.