sign-language-processing / datasets

TFDS data loaders for sign language datasets.
https://sign-language-processing.github.io/#existing-datasets
79 stars 24 forks source link

Jehovah Witness Sign Language Resources #29

Open AmitMY opened 1 year ago

AmitMY commented 1 year ago

We should add resources from JW, like the bible.

bricksdont commented 1 year ago

@ShesterG this is something you could perhaps take on ;)

cleong110 commented 7 months ago

Chipping in here to provide a bit of assistance with Shester's dataset. Asked his permission to help!

It seems he got started with this at https://github.com/ShesterG/datasets/tree/shester/sign_language_datasets/datasets/jw_sign

cleong110 commented 7 months ago

Shester informs me that the annotations can be recreated via https://github.com/ShesterG/datasets/blob/shester/sign_language_datasets/datasets/jw_sign/create_index.py, it just takes a long time to run.

They've been precomputed/saved off, they just need to be hosted somewhere.

cleong110 commented 6 months ago

OK, for now the files are uploaded to https://drive.google.com/drive/folders/1QFmq5Byg0xTLgJ7sBdVuQBlgxrkzp9vV , thank you Shester! Now we need to...

cleong110 commented 6 months ago

One thing for us to consider: Google Drive will sometimes cause issues if too many people download the same files. See https://github.com/tensorflow/datasets/issues/1482

and

https://www.tensorflow.org/datasets/overview#manual_download_if_download_fails

cleong110 commented 6 months ago

OK, the following actually manages to download newindex.list.gz, but saves it with a weird name. However when I manually rename it, I can open the file with 7zip and see it's the right file.

import tensorflow_datasets as tfds

if __name__ == "__main__":
    ####################################
    # try to download newindex.list.gz
    #####################################

    # downloads a 0 MB empty file
    google_drive_link_to_newindex = "https://drive.google.com/file/d/1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf/view?usp=drive_link"

    # extract the ID from above and append it to "https://drive.google.com/uc?id="
    # downloads an actual file. 
    google_drive_link_to_newindex_take2 = "https://drive.google.com/uc?id=1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf"
    dl_manager = tfds.download.DownloadManager(download_dir="./foo")

    extracted_path = dl_manager.download(google_drive_link_to_newindex_take2)

    # ends up printing "foo\ucid_1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk"
    # which is 1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf (the ID from above, followed by "2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk" which I don't understand what that is)
    print(extracted_path)

image

cleong110 commented 6 months ago

Implication here is that the strategy of doing URLs like "https://drive.google.com/uc?id=" seems to work

cleong110 commented 6 months ago

OK, the next thing I want to figure out is how to actually download and load files.

DGS Corpus actually includes a "dgs.json" in github, about 440 kB in size.

https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/dgs_corpus/dgs.json

When I open it up with Firefox, the format looks like this, looks like there's links to files in there. image

This is most similar, I think, to our "newindex.list.gz", in that there's a list of unique data items, with URL links to videos.

cleong110 commented 6 months ago

Here's my notes on newindex.list.gz (drive link):

Keys: ['video_url', 'video_name', 'verse_lang', 'verse_name', 'verse_start', 'verse_end', 'duration', 'verse_unique', 'verseID']

First 10:

{'video_url': 'https://download-a.akamaihd.net/files/media_publication/f3/nwt_01_Ge_ALS_03_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_03_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 3:15', 'verse_start': '0.000000', 'verse_end': '31.198000', 'duration': 31.198, 'verse_unique': 'ALS Zan. 3:15', 'verseID': 'v1003015'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:2', 'verse_start': '0.000000', 'verse_end': '26.760000', 'duration': 26.76, 'verse_unique': 'ALS Zan. 39:2', 'verseID': 'v1039002'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:3', 'verse_start': '26.760000', 'verse_end': '47.848000', 'duration': 21.087999999999997, 'verse_unique': 'ALS Zan. 39:3', 'verseID': 'v1039003'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/10/nwt_03_Le_ALS_19_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_19_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 19:18', 'verse_start': '0.000000', 'verse_end': '32.399000', 'duration': 32.399, 'verse_unique': 'ALS Lev. 19:18', 'verseID': 'v3019018'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/0c/nwt_03_Le_ALS_25_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_25_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 25:10', 'verse_start': '0.000000', 'verse_end': '8.320000', 'duration': 8.32, 'verse_unique': 'ALS Lev. 25:10', 'verseID': 'v3025010'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:6', 'verse_start': '0.000000', 'verse_end': '7.341000', 'duration': 7.341, 'verse_unique': 'ALS Ligj. 6:6', 'verseID': 'v5006006'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:7', 'verse_start': '7.341000', 'verse_end': '24.024000', 'duration': 16.683, 'verse_unique': 'ALS Ligj. 6:7', 'verseID': 'v5006007'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/3d/nwt_05_De_ALS_10_r720P.mp4', 'video_name': 'nwt_05_De_ALS_10_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 10:20', 'verse_start': '0.000000', 'verse_end': '10.644000', 'duration': 10.644, 'verse_unique': 'ALS Ligj. 10:20', 'verseID': 'v5010020'}       
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/34/nwt_05_De_ALS_32_r720P.mp4', 'video_name': 'nwt_05_De_ALS_32_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 32:4', 'verse_start': '0.000000', 'verse_end': '43.844000', 'duration': 43.844, 'verse_unique': 'ALS Ligj. 32:4', 'verseID': 'v5032004'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/1e/nwt_09_1Sa_ALS_01_r720P.mp4', 'video_name': 'nwt_09_1Sa_ALS_01_r720P', 'verse_lang': 'ALS', 'verse_name': '1 Sam. 1:15', 'verse_start': '0.000000', 'verse_end': '23.557000', 'duration': 23.557, 'verse_unique': 'ALS 1 Sam. 1:15', 'verseID': 'v9001015'}

Which is a pickled list, compressed with gzip.

In compressed form it is about 19,000 KB, or about 19 MB.

In decompressed form it's closer to 100MB.

cleong110 commented 6 months ago

(Side note: investigate Parquet data format?)

cleong110 commented 6 months ago

(or Arrow?)

cleong110 commented 6 months ago

Another TODO at some point: Upload files to a better hosting platform, e.g. Zenodo or Zindi, to prevent issues with "too many downloads".

cleong110 commented 6 months ago

What are the numbers in DGS? Unique IDs? Should we generate some for our dataset? image

cleong110 commented 6 months ago

JSON for DGS is parsed here: https://github.com/sign-language-processing/datasets/blob/e864f36ddc452587a80f7622630a0871cd406a0d/sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py#L278

cleong110 commented 6 months ago

And the JSON is created here: https://github.com/sign-language-processing/datasets/blob/e864f36ddc452587a80f7622630a0871cd406a0d/sign_language_datasets/datasets/dgs_corpus/create_index.py#L17, which calls the numbers "tr_id"

cleong110 commented 6 months ago

Ah... "transcript ID". And they're not generated in the Python code, they're parsed from the source page via regex.

cleong110 commented 6 months ago

https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md helpful guide for TFDS datasets.

Also apparently tfds.testing is a thing, used for example in dgs_corpus_test.py https://tensorflow.google.cn/datasets/api_docs/python/tfds/testing

cleong110 commented 6 months ago

https://tensorflow.google.cn/datasets/add_dataset?hl=en actually the helpful guide above is source code for this

cleong110 commented 6 months ago

Went and figured out how the index was created, and pushed an updated version of the create_index.py https://github.com/ShesterG/datasets/pull/1

Now that I know the data a bit better, gonna move on to filling out the Builder class for JWSign

cleong110 commented 5 months ago

image_480 From the presented slides for JWSign, this is what we're going for

cleong110 commented 5 months ago

Not being familiar with tfds or sign_language_datasets I am attempting to go with a "get basic functionality working and then test it" approach. But then I ran into the issue of not knowing how to test a dataset locally. #53 documents part of this, but the basic guide to testing is:

  1. Make sure you install from source.
  2. pip install pytest pytest-cov dill to get the testing deps
  3. pytest . in whatever folder you want to run tests for, incl. the top-level.

Of course the next question is how to make tests!

cleong110 commented 5 months ago

OK, so even if you just follow https://tensorflow.google.cn/datasets/add_dataset?hl=en#test_your_dataset and add nothing, you'll still get some basic unit tests: image

cleong110 commented 4 months ago

OK, testing procedure:

conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source 
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .
cleong110 commented 4 months ago

All right, after messing around with #56, it seems that by deleting this file I am then able to run pytests.

I had also ran into another weird issue in #57, where pytest ran into an error while telling me what a different error was.

Now I can finally proceed with JW Sign dataset some more. Let's see if I can make a version which at least downloads the spoken-language text, and maybe make a much simplified index for testing purposes.

cleong110 commented 4 months ago

In order to iterate/test the dataset I will need to:

# in the top-level directory of the repo, with __init__.py removed
# make some change to the builder script
pytest ./sign_language_datasets/datasets/new_dataset/
# pip install . # not necessary actually, you can simply run the test
cleong110 commented 4 months ago

OK, I did

# navigate to sign_language_datasets/datasets/
tfds new new_dataset # create a new directory

And then repeatedly edited, pytested, using the rwth_phoenix2014_t code as a base, until the VideoTest passed. Excellent.

cleong110 commented 4 months ago

OK, I'm starting from scratch. I made a new fork, this time forking off of https://github.com/sign-language-processing/datasets so that I'm up to date.

cleong110 commented 4 months ago

https://github.com/cleong110/datasets/tree/jw_sign

cleong110 commented 4 months ago

I want to see if I can make a completely basic text-only dataset to start.

cleong110 commented 4 months ago

Apparently Google Drive doesn't play nice. When I try to use tfds' download_and_extract method on the text51.dict.gz file, I get a .html instead.

Turns out Google likes to pop up a "can't scan this for viruses" message, and that's what gets downloaded.

gdown library works, but then that doesn't play with tfds

Here's my Colab notebook playing with it: https://colab.research.google.com/drive/1EMKnpKrDUHxq5COFM6Acm7PqAQmkdvTS?usp=sharing

cleong110 commented 4 months ago

Workaround: split the text dict into one for every spoken language

cleong110 commented 4 months ago

To get the links for all 51 .json files:

Go to the folder: https://drive.google.com/drive/folders/1r-ftcljPRm1kLasqCK_cYL9zc4mxE6_o

Select all the files, you have to scroll a bit because it only shows 50 by default

right-click->Share->copy links

https://drive.google.com/file/d/122Fs-e5O9SPELpE83FohdXky9r1QIpDB/view?usp=drive_link, https://drive.google.com/file/d/1-KptAhyZCfnxGG4OMOVHXfgFqfjhmduu/view?usp=drive_link, https://drive.google.com/file/d/1-QQKmrW0iI9lBLxihtnxOgT_xAQSWCt7/view?usp=drive_link, https://drive.google.com/file/d/12Ahgjl3wbho9ShwlGVdtUx-uDnGacPFn/view?usp=drive_link, https://drive.google.com/file/d/1DMTgyGq9td8XpWqMeez7Hv1Cy-UyIAmd/view?usp=drive_link, https://drive.google.com/file/d/1Dze5_WZyXkAq8gca9eiPnofDTbXiFCYy/view?usp=drive_link, https://drive.google.com/file/d/1FXZF2SMv4GZ4visJrUwxY9vnHIs2Gzli/view?usp=drive_link, https://drive.google.com/file/d/1PyUgjqq6gt5kf4Pa4fRJok8b5eech2An/view?usp=drive_link, https://drive.google.com/file/d/1jOInwyi6XAAkwJ5iMLc90NI9W8MlbARF/view?usp=drive_link, https://drive.google.com/file/d/10peSKPs99feSfsyWmkSmaONM8H2pGWEX/view?usp=drive_link, https://drive.google.com/file/d/1NoCcaKy_BPrlboXP5kx4_iP3DrOq0ITq/view?usp=drive_link, https://drive.google.com/file/d/1dNnGpaoMGR4IPhWuH-TIobDyeL3tcdHx/view?usp=drive_link, https://drive.google.com/file/d/1dqO8p1tD_UUXUw3pDMCYwOIAS2niKHgm/view?usp=drive_link, https://drive.google.com/file/d/1iwF3OKvo4WmWvlqXjoDqZ1j41C4qthjA/view?usp=drive_link, https://drive.google.com/file/d/1tXHY2m4-P_jD7I4xBvkbB3lUX8FXVNmi/view?usp=drive_link, https://drive.google.com/file/d/139UErDv_QeaAmm5n7l5IqsktC5b3hO5A/view?usp=drive_link, https://drive.google.com/file/d/1Z5iTuQGTl15oh_xm9cSrtJkkmfu9s7qz/view?usp=drive_link, https://drive.google.com/file/d/1kccYftVcapjNLXxE-VYZpIOVNcRoa2mI/view?usp=drive_link, https://drive.google.com/file/d/1r8ao3bUf4xcsyTqJp0AQBiM29Y2wXBcO/view?usp=drive_link, https://drive.google.com/file/d/1rY6VjXhQXL330uNpxmekrpi_xs3T2JeK/view?usp=drive_link, https://drive.google.com/file/d/1tOMJNzNYo-Bpo6rxDZW94tpBnH9lZadv/view?usp=drive_link, https://drive.google.com/file/d/13-C5Z3YFjEE4dpstt3hDbNORiGgB4BgP/view?usp=drive_link, https://drive.google.com/file/d/19-zA-4dsfB-LZcDWXiOKNnniNuEHCZh2/view?usp=drive_link, https://drive.google.com/file/d/1GR3A6NXnsoItIwaxQvflCXufhLV-xCz2/view?usp=drive_link, https://drive.google.com/file/d/1KApiflPkVm6Jn0sGw2OyRT__VAec_bFd/view?usp=drive_link, https://drive.google.com/file/d/1NibQFTL0gGUL9NYnYFjlk_uCIZSM-RqA/view?usp=drive_link, https://drive.google.com/file/d/1SINxYL1u2T-dG79TjQmj2AZsfQB8rTaC/view?usp=drive_link, https://drive.google.com/file/d/1wE9Po5-nrr9PS-xdT_F8WK-kDyYCDZAp/view?usp=drive_link, https://drive.google.com/file/d/1Df5j9YsEMdvNx9NR7zl1gE58mnIZkS06/view?usp=drive_link, https://drive.google.com/file/d/1K8HCwsEtdKba248wPxbZJcRDyP4NlfLp/view?usp=drive_link, https://drive.google.com/file/d/1hT0dqllsIUL5G6AKP_vA1bkzir1ZYT0q/view?usp=drive_link, https://drive.google.com/file/d/1mZLUo9k8VTRyPdvrJEduUxMQv4Cxdotk/view?usp=drive_link, https://drive.google.com/file/d/1sioyZKRvTfujJ0aeJYPOYAEXZz5pIpTp/view?usp=drive_link, https://drive.google.com/file/d/1ygGnPbz4ssjwXNyZRmOGgkbcFfOzHu1b/view?usp=drive_link, https://drive.google.com/file/d/1LOsj8qvhmyRtfmaimELeBIiD4xl1vFeW/view?usp=drive_link, https://drive.google.com/file/d/1NjK0YIAowCv5uMv4yEcKdi-dU6LoVwD6/view?usp=drive_link, https://drive.google.com/file/d/1Nv8ecYBPdogebdtGT4HcAGdRxI_hf-yI/view?usp=drive_link, https://drive.google.com/file/d/1_6nC34lGBDRSZVAM5msW4Ol-BbyvoMcK/view?usp=drive_link, https://drive.google.com/file/d/1eTHQKEotMJm20BKe--CLQfBqUUvlFZpo/view?usp=drive_link, https://drive.google.com/file/d/1jizBtuPzBA8Bcy5IMs-EeF2_q41A68zr/view?usp=drive_link, https://drive.google.com/file/d/1sJhcz_mwCGafr9hi0aLQkr1U91_cq6Qx/view?usp=drive_link, https://drive.google.com/file/d/1EOaMjlUVy-hNGLqH3zLfGtxSVRN0X9O1/view?usp=drive_link, https://drive.google.com/file/d/1HPz-ZDjJeomlqNpxc5cWsEO4P7liqiiU/view?usp=drive_link, https://drive.google.com/file/d/1_3H2N92wLAEIqi9VF735KeSPGqQtJE2Q/view?usp=drive_link, https://drive.google.com/file/d/1_C-cqwzEJI89tNjLlSsvHQoDgnBMELGr/view?usp=drive_link, https://drive.google.com/file/d/1gSlQrYvfB1m26npbNRYXP14idcn-_2aA/view?usp=drive_link, https://drive.google.com/file/d/1rrrn73YFhC4yjUwbPKxcSzaWQ8FzKXY2/view?usp=drive_link, https://drive.google.com/file/d/1z_KYeV2u0KgNOjZ_9ARnp4PnF_bJFkf5/view?usp=drive_link, https://drive.google.com/file/d/13wKXt6R4h_trTlDVtYXZRRPOFJ9omSnz/view?usp=drive_link, https://drive.google.com/file/d/1BQaj9_RC_lnsc3kJVSFKWw2o_hWp_X4c/view?usp=drive_link, https://drive.google.com/file/d/1P5fzxscp5uoq3AtxKthu29D7BYfV4nKe/view?usp=drive_link
cleong110 commented 4 months ago

And of course I can split those one by one and get the link in a format that tfds can download...

...except, how do I re-associate the filename with the link?

cleong110 commented 4 months ago

I suppose I could just add the key back in? Then I do still have to download all 51 files, but at least the relevant info will still be inside each one.


{"spoken_language": "de", 
"data": {"v1001001": "1 \u00a0Am Anfang erschuf Gott Himmel und Erde.+", "v1001002": "2\u00a0\u00a0Die Erde nun war formlos und \u00f6de*. \u00dcber dem tief
}
cleong110 commented 4 months ago

OK going with that for now, we can compress them later. I just want to get something running

cleong110 commented 4 months ago

With a bit of munging I was able to download all the files, read the code, and then create a spoken_lang_text_file_download_urls.json dictionary of download URLs, which I saved to a .json

cleong110 commented 4 months ago

Gonna have to call it for today, but I added some notes to jw_sign.py for next time.

cleong110 commented 4 months ago

TODO: code to generate the .json files containing text for each spoken language, on demand. Those need to be re-scraped each time