huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19k stars 2.63k forks source link

Add a Depth Estimation dataset - DIODE / NYUDepth / KITTI #5255

Closed sayakpaul closed 1 year ago

sayakpaul commented 1 year ago

Name

NYUDepth

Paper

http://cs.nyu.edu/~silberman/papers/indoor_seg_support.pdf

Data

https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

Motivation

Depth estimation is an important problem in computer vision. We have a couple of Depth Estimation models on Hub as well:

Would be nice to have a dataset for depth estimation. These datasets usually have three things: input image, depth map image, and depth mask (validity mask to indicate if a reading for a pixel is valid or not). Since we already have semantic segmentation datasets on the Hub, I don't think we need any extended utilities to support this addition.

Having this dataset would also allow us to author data preprocessing guides for depth estimation, particularly like the ones we have for other tasks (example).

Ccing @osanseviero @nateraw @NielsRogge

Happy to work on adding it.

osanseviero commented 1 year ago

Also cc @mariosasko and @lhoestq

lhoestq commented 1 year ago

Cool ! Let us know if you have questions or if we can help :)

I guess we'll also have to create the NYU CS Department on the Hub ?

sayakpaul commented 1 year ago

I guess we'll also have to create the NYU CS Department on the Hub ?

Yes, you're right! Let me add it to my profile first, and then we can transfer. Meanwhile, if it's recommended to loop the dataset author in here, let me know.

Also, the NYU Depth dataset seems big. Any example scripts for creating image datasets that I could refer?

lhoestq commented 1 year ago

You can check the imagenet-1k one.

PS: If the licenses allows it, it'b be nice to host the dataset as sharded TAR archives (like imagenet-1k) instead of the ZIP format they use:

if it's recommended to loop the dataset author in here, let me know.

It's recommended indeed, you can send them an email once you have the dataset ready and invite them to the org on the Hub

sayakpaul commented 1 year ago

You can check the imagenet-1k one.

Where can I find the script? Are you referring to https://huggingface.co/docs/datasets/image_process ? Or is there anything more specific?

lhoestq commented 1 year ago

You can find it here: https://huggingface.co/datasets/imagenet-1k/blob/main/imagenet-1k.py

sayakpaul commented 1 year ago

Update: started working on it here: https://huggingface.co/datasets/sayakpaul/nyu_depth_v2.

I am facing an issue and I have detailed it here: https://huggingface.co/datasets/sayakpaul/nyu_depth_v2/discussions/1

Edit: The issue is gone.

However, since the dataset is distributed as a single TAR archive (following the URL used in TensorFlow Datasets) the loading is taking longer. How would suggest to shard the single TAR archive?

@lhoestq

sayakpaul commented 1 year ago

A Colab Notebook demonstrating the dataset loading part:

https://colab.research.google.com/gist/sayakpaul/aa0958c8d4ad8518d52a78f28044d871/scratchpad.ipynb

@osanseviero @lhoestq

I will work on a notebook to work with the dataset including data visualization.

sayakpaul commented 1 year ago

@osanseviero @lhoestq things seem to work fine with the current version of the dataset here. Here's a notebook I developed to help with visualization: https://colab.research.google.com/drive/1K3ZU8XUPRDOYD38MQS9nreQXJYitlKSW?usp=sharing.

@lhoestq I need your help with the following:

However, since the dataset is distributed as a single TAR archive (following the URL used in TensorFlow Datasets) the loading is taking longer. How would suggest to shard the single TAR archive?

@osanseviero @lhoestq question for you:

Where should we host the dataset? I think hosting it under hf.co/datasets (that is HF is the org) is fine as we have ImageNet-1k hosted similarly. We could then reach out to Diana Wofk (author of Fast Depth and the owner of the repo on which TFDS NYU Depth V2 is based) for a review. WDYT?

lhoestq commented 1 year ago

However, since the dataset is distributed as a single TAR archive (following the URL used in TensorFlow Datasets) the loading is taking longer. How would suggest to shard the single TAR archive?

First you can separate the train data and the validation data.

Then since the dataset is quite big, you can even shard the train split and the validation split in multiple TAR archives. Something around 16 archives for train and 4 for validation would be fine for example.

Also no need to gzip the TAR archives, the images are already compressed in png or jpeg.

sayakpaul commented 1 year ago

Then since the dataset is quite big, you can even shard the train split and the validation split in multiple TAR archives. Something around 16 archives for train and 4 for validation would be fine for example.

Yes, I got you. But this process seems to be manual and should be tailored for the given dataset. Do you have any script that you used to create the ImageNet-1k shards?

Also no need to gzip the TAR archives, the images are already compressed in png or jpeg.

I was not going to do that. Not sure what brought it up.

lhoestq commented 1 year ago

Yes, I got you. But this process seems to be manual and should be tailored for the given dataset. Do you have any script that you used to create the ImageNet-1k shards?

I don't, but I agree it'd be nice to have a script for that !

I was not going to do that. Not sure what brought it up.

The original dataset is gzipped for some reason

sayakpaul commented 1 year ago

Oh, I am using this URL for the download: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/datasets/nyu_depth_v2/nyu_depth_v2_dataset_builder.py#L24.

lhoestq commented 1 year ago

Where should we host the dataset? I think hosting it under hf.co/datasets (that is HF is the org) is fine as we have ImageNet-1k hosted similarly.

Maybe you can create an org for NYU Courant (this is the institute of the lab of the main author of the dataset if I'm not mistaken), and invite the authors to join.

We don't add datasets without namespace anymore

sayakpaul commented 1 year ago

Updates: https://huggingface.co/datasets/sayakpaul/nyu_depth_v2/discussions/5

The entire process (preparing multiple archives, preparing data loading script, etc.) was fun and engaging, thanks to the documentation. I believe we could work on a small blog post that would work as a reference for the future contributors following this path. What say?

Cc: @lhoestq @osanseviero

lhoestq commented 1 year ago

I believe we could work on a small blog post that would work as a reference for the future contributors following this path. What say?

@polinaeterna already mentioned it would be nice to present this process for audio (it's exactly the same), I believe it can be useful to many people

sayakpaul commented 1 year ago

Cool. Let's work on that after the NYU Depth Dataset is fully in on Hub (under the appropriate org). 🤗

sayakpaul commented 1 year ago

@lhoestq need to discuss something while I am adding the dataset card to https://huggingface.co/datasets/sayakpaul/nyu_depth_v2/.

As per Papers With Code, NYU Depth v2 is used for many different tasks:

So, while writing the supported task part of the dataset card, should we focus on all these? IMO, we could focus on just depth estimation and semantic segmentation for now since we have supported models for these two. WDYT?

Also, I am getting:

remote: Your push was accepted, but with warnings:
remote: - Warning: The task_ids "depth-estimation" is not in the official list: acceptability-classification, entity-linking-classification, fact-checking, intent-classification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering
remote: ----------------------------------------------------------
remote: Please find the documentation at:
remote: https://huggingface.co/docs/hub/model-cards#model-card-metadata

What should be the plan of action for this?

Cc: @osanseviero

osanseviero commented 1 year ago

What should be the plan of action for this?

When you merged https://github.com/huggingface/hub-docs/pull/488, there is a JS Interfaces GitHub Actions workflow that runs https://github.com/huggingface/hub-docs/actions/workflows/js-interfaces-tests.yml. It has a step called export-task scripts which exports an interface you can use in dataset. If you look at the logs, it prints out a map. This map can replace https://github.com/huggingface/datasets/blob/main/src/datasets/utils/resources/tasks.json (tasks.json was generated with this script), which should add depth estimation

sayakpaul commented 1 year ago

Thanks @osanseviero.

https://github.com/huggingface/datasets/pull/5335

sayakpaul commented 1 year ago

Closing the issue as the dataset has been successfully added: https://huggingface.co/datasets/sayakpaul/nyu_depth_v2