sign-language-processing / datasets

TFDS data loaders for sign language datasets.
https://sign-language-processing.github.io/#existing-datasets
82 stars 26 forks source link

Requesting permission to add Indian Sign Language dataset #75

Open professorcode1 opened 2 months ago

professorcode1 commented 2 months ago

The Indian Sign Language Research and Training Center has a staggering 11993 videos labeled with the word they correspond to in extremely high quality. I would like to add it to this repository.

The videos can either be accessed through their website(https://divyangjan.depwd.gov.in/islrtc/) or this google drive link (https://drive.google.com/drive/folders/1U-Pr4r1-cupgNOOq9NH_uTsQnPSVEKco).

Challenges:

  1. The website shows the video via Youtube embedding. Those video being on Youtube means that the drive will have to be used.
  2. The total number of videos is 11993 which total to 120 GB's of data.
  3. Regular Google Drive downloader that don't hit the Google Drive API can only download the first 100 files for a folder. To access the Google Drive API you need to register a project on Google Dev tools.

Proposed solution The user will have 2 options

  1. Provide their API keys if they wish to access the entire dataset.
  2. Just use 100 videos per alphabet (that's still 2600 videos and ~26 GB's of data)

In either case the dataset will not synchronously download the entire dataset since drive download speeds tends to be limited and making too many requests too quickly can get your API keys banned. Rather, it will maintain a buffer of videos (say 100 videos) and once a person yields enough samples(say 66%), it will asynchronously dispatch a request to add more.

Please let me know your thoughts. Thanks!

AmitMY commented 2 months ago

Hi @professorcode1 My thoughts are as follows: Perhaps, similar to MS-ASL, YouTube-ASL and YouTube-SL-25, there should be a base dataset called YouTube. Then, every implementation should specify the data (text, id, gloss (if any), signwriting (if any) and video link to youtube. The base dataset will be in charge of downloading from YouTube directly.

What do you think?

professorcode1 commented 2 months ago

Hey @AmitMY.

Please tell me what all functionality the base Youtube dataset class should have. It might add unnecessary complexity if all it does is call download_youtube function on behalf of its derived classes.

AmitMY commented 2 months ago

I can't tell you all the functionality, since I did not build it, I can just imagine that there needs to be a unified way to download videos from youtube (using youtube-dl or something similar)

You could start, make a PR, and I'll be happy to give feedback