sign-language-processing / datasets

TFDS data loaders for sign language datasets.
https://sign-language-processing.github.io/#existing-datasets
83 stars 27 forks source link

Jehovah Witness Sign Language Resources #29

Open AmitMY opened 1 year ago

AmitMY commented 1 year ago

We should add resources from JW, like the bible.

bricksdont commented 1 year ago

@ShesterG this is something you could perhaps take on ;)

cleong110 commented 11 months ago

Chipping in here to provide a bit of assistance with Shester's dataset. Asked his permission to help!

It seems he got started with this at https://github.com/ShesterG/datasets/tree/shester/sign_language_datasets/datasets/jw_sign

cleong110 commented 11 months ago

Shester informs me that the annotations can be recreated via https://github.com/ShesterG/datasets/blob/shester/sign_language_datasets/datasets/jw_sign/create_index.py, it just takes a long time to run.

They've been precomputed/saved off, they just need to be hosted somewhere.

cleong110 commented 11 months ago

OK, for now the files are uploaded to https://drive.google.com/drive/folders/1QFmq5Byg0xTLgJ7sBdVuQBlgxrkzp9vV , thank you Shester! Now we need to...

cleong110 commented 11 months ago

One thing for us to consider: Google Drive will sometimes cause issues if too many people download the same files. See https://github.com/tensorflow/datasets/issues/1482

and

https://www.tensorflow.org/datasets/overview#manual_download_if_download_fails

cleong110 commented 11 months ago

OK, the following actually manages to download newindex.list.gz, but saves it with a weird name. However when I manually rename it, I can open the file with 7zip and see it's the right file.

import tensorflow_datasets as tfds

if __name__ == "__main__":
    ####################################
    # try to download newindex.list.gz
    #####################################

    # downloads a 0 MB empty file
    google_drive_link_to_newindex = "https://drive.google.com/file/d/1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf/view?usp=drive_link"

    # extract the ID from above and append it to "https://drive.google.com/uc?id="
    # downloads an actual file. 
    google_drive_link_to_newindex_take2 = "https://drive.google.com/uc?id=1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf"
    dl_manager = tfds.download.DownloadManager(download_dir="./foo")

    extracted_path = dl_manager.download(google_drive_link_to_newindex_take2)

    # ends up printing "foo\ucid_1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk"
    # which is 1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf (the ID from above, followed by "2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk" which I don't understand what that is)
    print(extracted_path)

image

cleong110 commented 11 months ago

Implication here is that the strategy of doing URLs like "https://drive.google.com/uc?id=" seems to work

cleong110 commented 11 months ago

OK, the next thing I want to figure out is how to actually download and load files.

DGS Corpus actually includes a "dgs.json" in github, about 440 kB in size.

https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/dgs_corpus/dgs.json

When I open it up with Firefox, the format looks like this, looks like there's links to files in there. image

This is most similar, I think, to our "newindex.list.gz", in that there's a list of unique data items, with URL links to videos.

cleong110 commented 11 months ago

Here's my notes on newindex.list.gz (drive link):

Keys: ['video_url', 'video_name', 'verse_lang', 'verse_name', 'verse_start', 'verse_end', 'duration', 'verse_unique', 'verseID']

First 10:

{'video_url': 'https://download-a.akamaihd.net/files/media_publication/f3/nwt_01_Ge_ALS_03_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_03_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 3:15', 'verse_start': '0.000000', 'verse_end': '31.198000', 'duration': 31.198, 'verse_unique': 'ALS Zan. 3:15', 'verseID': 'v1003015'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:2', 'verse_start': '0.000000', 'verse_end': '26.760000', 'duration': 26.76, 'verse_unique': 'ALS Zan. 39:2', 'verseID': 'v1039002'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:3', 'verse_start': '26.760000', 'verse_end': '47.848000', 'duration': 21.087999999999997, 'verse_unique': 'ALS Zan. 39:3', 'verseID': 'v1039003'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/10/nwt_03_Le_ALS_19_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_19_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 19:18', 'verse_start': '0.000000', 'verse_end': '32.399000', 'duration': 32.399, 'verse_unique': 'ALS Lev. 19:18', 'verseID': 'v3019018'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/0c/nwt_03_Le_ALS_25_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_25_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 25:10', 'verse_start': '0.000000', 'verse_end': '8.320000', 'duration': 8.32, 'verse_unique': 'ALS Lev. 25:10', 'verseID': 'v3025010'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:6', 'verse_start': '0.000000', 'verse_end': '7.341000', 'duration': 7.341, 'verse_unique': 'ALS Ligj. 6:6', 'verseID': 'v5006006'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:7', 'verse_start': '7.341000', 'verse_end': '24.024000', 'duration': 16.683, 'verse_unique': 'ALS Ligj. 6:7', 'verseID': 'v5006007'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/3d/nwt_05_De_ALS_10_r720P.mp4', 'video_name': 'nwt_05_De_ALS_10_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 10:20', 'verse_start': '0.000000', 'verse_end': '10.644000', 'duration': 10.644, 'verse_unique': 'ALS Ligj. 10:20', 'verseID': 'v5010020'}       
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/34/nwt_05_De_ALS_32_r720P.mp4', 'video_name': 'nwt_05_De_ALS_32_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 32:4', 'verse_start': '0.000000', 'verse_end': '43.844000', 'duration': 43.844, 'verse_unique': 'ALS Ligj. 32:4', 'verseID': 'v5032004'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/1e/nwt_09_1Sa_ALS_01_r720P.mp4', 'video_name': 'nwt_09_1Sa_ALS_01_r720P', 'verse_lang': 'ALS', 'verse_name': '1 Sam. 1:15', 'verse_start': '0.000000', 'verse_end': '23.557000', 'duration': 23.557, 'verse_unique': 'ALS 1 Sam. 1:15', 'verseID': 'v9001015'}

Which is a pickled list, compressed with gzip.

In compressed form it is about 19,000 KB, or about 19 MB.

In decompressed form it's closer to 100MB.

cleong110 commented 10 months ago

(Side note: investigate Parquet data format?)

cleong110 commented 10 months ago

(or Arrow?)

cleong110 commented 10 months ago

Another TODO at some point: Upload files to a better hosting platform, e.g. Zenodo or Zindi, to prevent issues with "too many downloads".

cleong110 commented 10 months ago

What are the numbers in DGS? Unique IDs? Should we generate some for our dataset? image

cleong110 commented 10 months ago

JSON for DGS is parsed here: https://github.com/sign-language-processing/datasets/blob/e864f36ddc452587a80f7622630a0871cd406a0d/sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py#L278

cleong110 commented 10 months ago

And the JSON is created here: https://github.com/sign-language-processing/datasets/blob/e864f36ddc452587a80f7622630a0871cd406a0d/sign_language_datasets/datasets/dgs_corpus/create_index.py#L17, which calls the numbers "tr_id"

cleong110 commented 10 months ago

Ah... "transcript ID". And they're not generated in the Python code, they're parsed from the source page via regex.

cleong110 commented 10 months ago

https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md helpful guide for TFDS datasets.

Also apparently tfds.testing is a thing, used for example in dgs_corpus_test.py https://tensorflow.google.cn/datasets/api_docs/python/tfds/testing

cleong110 commented 10 months ago

https://tensorflow.google.cn/datasets/add_dataset?hl=en actually the helpful guide above is source code for this

cleong110 commented 10 months ago

Went and figured out how the index was created, and pushed an updated version of the create_index.py https://github.com/ShesterG/datasets/pull/1

Now that I know the data a bit better, gonna move on to filling out the Builder class for JWSign

cleong110 commented 9 months ago

image_480 From the presented slides for JWSign, this is what we're going for

cleong110 commented 9 months ago

Not being familiar with tfds or sign_language_datasets I am attempting to go with a "get basic functionality working and then test it" approach. But then I ran into the issue of not knowing how to test a dataset locally. #53 documents part of this, but the basic guide to testing is:

  1. Make sure you install from source.
  2. pip install pytest pytest-cov dill to get the testing deps
  3. pytest . in whatever folder you want to run tests for, incl. the top-level.

Of course the next question is how to make tests!

cleong110 commented 9 months ago

OK, so even if you just follow https://tensorflow.google.cn/datasets/add_dataset?hl=en#test_your_dataset and add nothing, you'll still get some basic unit tests: image

cleong110 commented 9 months ago

OK, testing procedure:

conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source 
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .
cleong110 commented 9 months ago

All right, after messing around with #56, it seems that by deleting this file I am then able to run pytests.

I had also ran into another weird issue in #57, where pytest ran into an error while telling me what a different error was.

Now I can finally proceed with JW Sign dataset some more. Let's see if I can make a version which at least downloads the spoken-language text, and maybe make a much simplified index for testing purposes.

cleong110 commented 9 months ago

In order to iterate/test the dataset I will need to:

# in the top-level directory of the repo, with __init__.py removed
# make some change to the builder script
pytest ./sign_language_datasets/datasets/new_dataset/
# pip install . # not necessary actually, you can simply run the test
cleong110 commented 9 months ago

OK, I did

# navigate to sign_language_datasets/datasets/
tfds new new_dataset # create a new directory

And then repeatedly edited, pytested, using the rwth_phoenix2014_t code as a base, until the VideoTest passed. Excellent.

cleong110 commented 9 months ago

OK, I'm starting from scratch. I made a new fork, this time forking off of https://github.com/sign-language-processing/datasets so that I'm up to date.

cleong110 commented 9 months ago

https://github.com/cleong110/datasets/tree/jw_sign

cleong110 commented 9 months ago

I want to see if I can make a completely basic text-only dataset to start.

cleong110 commented 9 months ago

Apparently Google Drive doesn't play nice. When I try to use tfds' download_and_extract method on the text51.dict.gz file, I get a .html instead.

Turns out Google likes to pop up a "can't scan this for viruses" message, and that's what gets downloaded.

gdown library works, but then that doesn't play with tfds

Here's my Colab notebook playing with it: https://colab.research.google.com/drive/1EMKnpKrDUHxq5COFM6Acm7PqAQmkdvTS?usp=sharing

cleong110 commented 9 months ago

Workaround: split the text dict into one for every spoken language

cleong110 commented 9 months ago

To get the links for all 51 .json files:

Go to the folder: https://drive.google.com/drive/folders/1r-ftcljPRm1kLasqCK_cYL9zc4mxE6_o

Select all the files, you have to scroll a bit because it only shows 50 by default

right-click->Share->copy links

https://drive.google.com/file/d/122Fs-e5O9SPELpE83FohdXky9r1QIpDB/view?usp=drive_link, https://drive.google.com/file/d/1-KptAhyZCfnxGG4OMOVHXfgFqfjhmduu/view?usp=drive_link, https://drive.google.com/file/d/1-QQKmrW0iI9lBLxihtnxOgT_xAQSWCt7/view?usp=drive_link, https://drive.google.com/file/d/12Ahgjl3wbho9ShwlGVdtUx-uDnGacPFn/view?usp=drive_link, https://drive.google.com/file/d/1DMTgyGq9td8XpWqMeez7Hv1Cy-UyIAmd/view?usp=drive_link, https://drive.google.com/file/d/1Dze5_WZyXkAq8gca9eiPnofDTbXiFCYy/view?usp=drive_link, https://drive.google.com/file/d/1FXZF2SMv4GZ4visJrUwxY9vnHIs2Gzli/view?usp=drive_link, https://drive.google.com/file/d/1PyUgjqq6gt5kf4Pa4fRJok8b5eech2An/view?usp=drive_link, https://drive.google.com/file/d/1jOInwyi6XAAkwJ5iMLc90NI9W8MlbARF/view?usp=drive_link, https://drive.google.com/file/d/10peSKPs99feSfsyWmkSmaONM8H2pGWEX/view?usp=drive_link, https://drive.google.com/file/d/1NoCcaKy_BPrlboXP5kx4_iP3DrOq0ITq/view?usp=drive_link, https://drive.google.com/file/d/1dNnGpaoMGR4IPhWuH-TIobDyeL3tcdHx/view?usp=drive_link, https://drive.google.com/file/d/1dqO8p1tD_UUXUw3pDMCYwOIAS2niKHgm/view?usp=drive_link, https://drive.google.com/file/d/1iwF3OKvo4WmWvlqXjoDqZ1j41C4qthjA/view?usp=drive_link, https://drive.google.com/file/d/1tXHY2m4-P_jD7I4xBvkbB3lUX8FXVNmi/view?usp=drive_link, https://drive.google.com/file/d/139UErDv_QeaAmm5n7l5IqsktC5b3hO5A/view?usp=drive_link, https://drive.google.com/file/d/1Z5iTuQGTl15oh_xm9cSrtJkkmfu9s7qz/view?usp=drive_link, https://drive.google.com/file/d/1kccYftVcapjNLXxE-VYZpIOVNcRoa2mI/view?usp=drive_link, https://drive.google.com/file/d/1r8ao3bUf4xcsyTqJp0AQBiM29Y2wXBcO/view?usp=drive_link, https://drive.google.com/file/d/1rY6VjXhQXL330uNpxmekrpi_xs3T2JeK/view?usp=drive_link, https://drive.google.com/file/d/1tOMJNzNYo-Bpo6rxDZW94tpBnH9lZadv/view?usp=drive_link, https://drive.google.com/file/d/13-C5Z3YFjEE4dpstt3hDbNORiGgB4BgP/view?usp=drive_link, https://drive.google.com/file/d/19-zA-4dsfB-LZcDWXiOKNnniNuEHCZh2/view?usp=drive_link, https://drive.google.com/file/d/1GR3A6NXnsoItIwaxQvflCXufhLV-xCz2/view?usp=drive_link, https://drive.google.com/file/d/1KApiflPkVm6Jn0sGw2OyRT__VAec_bFd/view?usp=drive_link, https://drive.google.com/file/d/1NibQFTL0gGUL9NYnYFjlk_uCIZSM-RqA/view?usp=drive_link, https://drive.google.com/file/d/1SINxYL1u2T-dG79TjQmj2AZsfQB8rTaC/view?usp=drive_link, https://drive.google.com/file/d/1wE9Po5-nrr9PS-xdT_F8WK-kDyYCDZAp/view?usp=drive_link, https://drive.google.com/file/d/1Df5j9YsEMdvNx9NR7zl1gE58mnIZkS06/view?usp=drive_link, https://drive.google.com/file/d/1K8HCwsEtdKba248wPxbZJcRDyP4NlfLp/view?usp=drive_link, https://drive.google.com/file/d/1hT0dqllsIUL5G6AKP_vA1bkzir1ZYT0q/view?usp=drive_link, https://drive.google.com/file/d/1mZLUo9k8VTRyPdvrJEduUxMQv4Cxdotk/view?usp=drive_link, https://drive.google.com/file/d/1sioyZKRvTfujJ0aeJYPOYAEXZz5pIpTp/view?usp=drive_link, https://drive.google.com/file/d/1ygGnPbz4ssjwXNyZRmOGgkbcFfOzHu1b/view?usp=drive_link, https://drive.google.com/file/d/1LOsj8qvhmyRtfmaimELeBIiD4xl1vFeW/view?usp=drive_link, https://drive.google.com/file/d/1NjK0YIAowCv5uMv4yEcKdi-dU6LoVwD6/view?usp=drive_link, https://drive.google.com/file/d/1Nv8ecYBPdogebdtGT4HcAGdRxI_hf-yI/view?usp=drive_link, https://drive.google.com/file/d/1_6nC34lGBDRSZVAM5msW4Ol-BbyvoMcK/view?usp=drive_link, https://drive.google.com/file/d/1eTHQKEotMJm20BKe--CLQfBqUUvlFZpo/view?usp=drive_link, https://drive.google.com/file/d/1jizBtuPzBA8Bcy5IMs-EeF2_q41A68zr/view?usp=drive_link, https://drive.google.com/file/d/1sJhcz_mwCGafr9hi0aLQkr1U91_cq6Qx/view?usp=drive_link, https://drive.google.com/file/d/1EOaMjlUVy-hNGLqH3zLfGtxSVRN0X9O1/view?usp=drive_link, https://drive.google.com/file/d/1HPz-ZDjJeomlqNpxc5cWsEO4P7liqiiU/view?usp=drive_link, https://drive.google.com/file/d/1_3H2N92wLAEIqi9VF735KeSPGqQtJE2Q/view?usp=drive_link, https://drive.google.com/file/d/1_C-cqwzEJI89tNjLlSsvHQoDgnBMELGr/view?usp=drive_link, https://drive.google.com/file/d/1gSlQrYvfB1m26npbNRYXP14idcn-_2aA/view?usp=drive_link, https://drive.google.com/file/d/1rrrn73YFhC4yjUwbPKxcSzaWQ8FzKXY2/view?usp=drive_link, https://drive.google.com/file/d/1z_KYeV2u0KgNOjZ_9ARnp4PnF_bJFkf5/view?usp=drive_link, https://drive.google.com/file/d/13wKXt6R4h_trTlDVtYXZRRPOFJ9omSnz/view?usp=drive_link, https://drive.google.com/file/d/1BQaj9_RC_lnsc3kJVSFKWw2o_hWp_X4c/view?usp=drive_link, https://drive.google.com/file/d/1P5fzxscp5uoq3AtxKthu29D7BYfV4nKe/view?usp=drive_link
cleong110 commented 9 months ago

And of course I can split those one by one and get the link in a format that tfds can download...

...except, how do I re-associate the filename with the link?

cleong110 commented 9 months ago

I suppose I could just add the key back in? Then I do still have to download all 51 files, but at least the relevant info will still be inside each one.


{"spoken_language": "de", 
"data": {"v1001001": "1 \u00a0Am Anfang erschuf Gott Himmel und Erde.+", "v1001002": "2\u00a0\u00a0Die Erde nun war formlos und \u00f6de*. \u00dcber dem tief
}
cleong110 commented 9 months ago

OK going with that for now, we can compress them later. I just want to get something running

cleong110 commented 9 months ago

With a bit of munging I was able to download all the files, read the code, and then create a spoken_lang_text_file_download_urls.json dictionary of download URLs, which I saved to a .json

cleong110 commented 9 months ago

Gonna have to call it for today, but I added some notes to jw_sign.py for next time.

cleong110 commented 9 months ago

TODO: code to generate the .json files containing text for each spoken language, on demand. Those need to be re-scraped each time