Closed sayakpaul closed 1 year ago
Also cc @mariosasko and @lhoestq
Cool ! Let us know if you have questions or if we can help :)
I guess we'll also have to create the NYU CS Department on the Hub ?
I guess we'll also have to create the NYU CS Department on the Hub ?
Yes, you're right! Let me add it to my profile first, and then we can transfer. Meanwhile, if it's recommended to loop the dataset author in here, let me know.
Also, the NYU Depth dataset seems big. Any example scripts for creating image datasets that I could refer?
You can check the imagenet-1k one.
PS: If the licenses allows it, it'b be nice to host the dataset as sharded TAR archives (like imagenet-1k) instead of the ZIP format they use:
if it's recommended to loop the dataset author in here, let me know.
It's recommended indeed, you can send them an email once you have the dataset ready and invite them to the org on the Hub
You can check the imagenet-1k one.
Where can I find the script? Are you referring to https://huggingface.co/docs/datasets/image_process ? Or is there anything more specific?
You can find it here: https://huggingface.co/datasets/imagenet-1k/blob/main/imagenet-1k.py
Update: started working on it here: https://huggingface.co/datasets/sayakpaul/nyu_depth_v2.
I am facing an issue and I have detailed it here: https://huggingface.co/datasets/sayakpaul/nyu_depth_v2/discussions/1
Edit: The issue is gone.
However, since the dataset is distributed as a single TAR archive (following the URL used in TensorFlow Datasets) the loading is taking longer. How would suggest to shard the single TAR archive?
@lhoestq
A Colab Notebook demonstrating the dataset loading part:
https://colab.research.google.com/gist/sayakpaul/aa0958c8d4ad8518d52a78f28044d871/scratchpad.ipynb
@osanseviero @lhoestq
I will work on a notebook to work with the dataset including data visualization.
@osanseviero @lhoestq things seem to work fine with the current version of the dataset here. Here's a notebook I developed to help with visualization: https://colab.research.google.com/drive/1K3ZU8XUPRDOYD38MQS9nreQXJYitlKSW?usp=sharing.
@lhoestq I need your help with the following:
However, since the dataset is distributed as a single TAR archive (following the URL used in TensorFlow Datasets) the loading is taking longer. How would suggest to shard the single TAR archive?
@osanseviero @lhoestq question for you:
Where should we host the dataset? I think hosting it under hf.co/datasets (that is HF is the org) is fine as we have ImageNet-1k hosted similarly. We could then reach out to Diana Wofk (author of Fast Depth and the owner of the repo on which TFDS NYU Depth V2 is based) for a review. WDYT?
However, since the dataset is distributed as a single TAR archive (following the URL used in TensorFlow Datasets) the loading is taking longer. How would suggest to shard the single TAR archive?
First you can separate the train data and the validation data.
Then since the dataset is quite big, you can even shard the train split and the validation split in multiple TAR archives. Something around 16 archives for train and 4 for validation would be fine for example.
Also no need to gzip the TAR archives, the images are already compressed in png or jpeg.
Then since the dataset is quite big, you can even shard the train split and the validation split in multiple TAR archives. Something around 16 archives for train and 4 for validation would be fine for example.
Yes, I got you. But this process seems to be manual and should be tailored for the given dataset. Do you have any script that you used to create the ImageNet-1k shards?
Also no need to gzip the TAR archives, the images are already compressed in png or jpeg.
I was not going to do that. Not sure what brought it up.
Yes, I got you. But this process seems to be manual and should be tailored for the given dataset. Do you have any script that you used to create the ImageNet-1k shards?
I don't, but I agree it'd be nice to have a script for that !
I was not going to do that. Not sure what brought it up.
The original dataset is gzipped for some reason
Oh, I am using this URL for the download: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/datasets/nyu_depth_v2/nyu_depth_v2_dataset_builder.py#L24.
Where should we host the dataset? I think hosting it under hf.co/datasets (that is HF is the org) is fine as we have ImageNet-1k hosted similarly.
Maybe you can create an org for NYU Courant (this is the institute of the lab of the main author of the dataset if I'm not mistaken), and invite the authors to join.
We don't add datasets without namespace anymore
Updates: https://huggingface.co/datasets/sayakpaul/nyu_depth_v2/discussions/5
The entire process (preparing multiple archives, preparing data loading script, etc.) was fun and engaging, thanks to the documentation. I believe we could work on a small blog post that would work as a reference for the future contributors following this path. What say?
Cc: @lhoestq @osanseviero
I believe we could work on a small blog post that would work as a reference for the future contributors following this path. What say?
@polinaeterna already mentioned it would be nice to present this process for audio (it's exactly the same), I believe it can be useful to many people
Cool. Let's work on that after the NYU Depth Dataset is fully in on Hub (under the appropriate org). 🤗
@lhoestq need to discuss something while I am adding the dataset card to https://huggingface.co/datasets/sayakpaul/nyu_depth_v2/.
As per Papers With Code, NYU Depth v2 is used for many different tasks:
So, while writing the supported task part of the dataset card, should we focus on all these? IMO, we could focus on just depth estimation and semantic segmentation for now since we have supported models for these two. WDYT?
Also, I am getting:
remote: Your push was accepted, but with warnings:
remote: - Warning: The task_ids "depth-estimation" is not in the official list: acceptability-classification, entity-linking-classification, fact-checking, intent-classification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering
remote: ----------------------------------------------------------
remote: Please find the documentation at:
remote: https://huggingface.co/docs/hub/model-cards#model-card-metadata
What should be the plan of action for this?
Cc: @osanseviero
What should be the plan of action for this?
When you merged https://github.com/huggingface/hub-docs/pull/488, there is a JS Interfaces GitHub Actions workflow that runs https://github.com/huggingface/hub-docs/actions/workflows/js-interfaces-tests.yml. It has a step called export-task scripts which exports an interface you can use in dataset
. If you look at the logs, it prints out a map. This map can replace https://github.com/huggingface/datasets/blob/main/src/datasets/utils/resources/tasks.json (tasks.json was generated with this script), which should add depth estimation
Thanks @osanseviero.
Closing the issue as the dataset has been successfully added: https://huggingface.co/datasets/sayakpaul/nyu_depth_v2
Name
NYUDepth
Paper
http://cs.nyu.edu/~silberman/papers/indoor_seg_support.pdf
Data
https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
Motivation
Depth estimation is an important problem in computer vision. We have a couple of Depth Estimation models on Hub as well:
Would be nice to have a dataset for depth estimation. These datasets usually have three things: input image, depth map image, and depth mask (validity mask to indicate if a reading for a pixel is valid or not). Since we already have semantic segmentation datasets on the Hub, I don't think we need any extended utilities to support this addition.
Having this dataset would also allow us to author data preprocessing guides for depth estimation, particularly like the ones we have for other tasks (example).
Ccing @osanseviero @nateraw @NielsRogge
Happy to work on adding it.