huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.91k stars 2.62k forks source link

contribute data loading for object detection datasets with yolo data format #4618

Open faizankshaikh opened 2 years ago

faizankshaikh commented 2 years ago

Is your feature request related to a problem? Please describe. At the moment, HF datasets loads image classification datasets out-of-the-box. There could be a data loader for loading standard object detection datasets (original discussion here)

Describe the solution you'd like I wrote a custom script to load dataset which has YOLO data format.

Describe alternatives you've considered The script can either be a standalone dataset builder, or a modified version of ImageFolder

Additional context I would be happy to contribute to this, but I would do it at a very slow pace (maybe a month or two) as I have my exams approaching 😄

mariosasko commented 2 years ago

Hi! The imagefolder script is already quite complex, so a standalone script sounds better. Also, I suggest we create an org on the Hub (e.g. hf-loaders) and store such scripts there for easier maintenance rather than having them as packaged modules (IMO only very generic loaders should be packaged). WDYT @lhoestq @albertvillanova @polinaeterna?

polinaeterna commented 2 years ago

@mariosasko sounds good to me!

faizankshaikh commented 2 years ago

Thank you for the suggestion @mariosasko . I agree with the point, but I have a few doubts

  1. How would the user access the script if it's not a part of the core codebase?
  2. Could you direct me as to what will be the tasks I have to do to contribute to the code? As per my understanding, it would be like
    1. Create a new org "hf-loaders" and add you (and more HF people) to the org
    2. Add data loader script as a (model?)
    3. Test it with a dataset on HF hub
  3. We should maybe brainstorm as to which public datasets have this format (YOLO type) and are the most important ones to test the script with. We can even add the datasets on HF Hub alongside the script
mariosasko commented 2 years ago
  1. Like this: load_dataset("hf-loaders/yolo", data_files=...)
  2. The steps would be:
    1. Create a new org hf-community-loaders (IMO a better name than "hf-loaders") and add me (as an admin)
    2. Create a new dataset repo yolo and add the loading script to it (yolo.py)
    3. Open a discussion to request our review
  3. I like this idea. Another option is to add snippets that describe how to load such datasets using the yolo loader.