Open AmitMY opened 1 year ago
@ShesterG this is something you could perhaps take on ;)
Chipping in here to provide a bit of assistance with Shester's dataset. Asked his permission to help!
It seems he got started with this at https://github.com/ShesterG/datasets/tree/shester/sign_language_datasets/datasets/jw_sign
Shester informs me that the annotations can be recreated via https://github.com/ShesterG/datasets/blob/shester/sign_language_datasets/datasets/jw_sign/create_index.py, it just takes a long time to run.
They've been precomputed/saved off, they just need to be hosted somewhere.
OK, for now the files are uploaded to https://drive.google.com/drive/folders/1QFmq5Byg0xTLgJ7sBdVuQBlgxrkzp9vV , thank you Shester! Now we need to...
One thing for us to consider: Google Drive will sometimes cause issues if too many people download the same files. See https://github.com/tensorflow/datasets/issues/1482
and
https://www.tensorflow.org/datasets/overview#manual_download_if_download_fails
OK, the following actually manages to download newindex.list.gz, but saves it with a weird name. However when I manually rename it, I can open the file with 7zip and see it's the right file.
import tensorflow_datasets as tfds
if __name__ == "__main__":
####################################
# try to download newindex.list.gz
#####################################
# downloads a 0 MB empty file
google_drive_link_to_newindex = "https://drive.google.com/file/d/1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf/view?usp=drive_link"
# extract the ID from above and append it to "https://drive.google.com/uc?id="
# downloads an actual file.
google_drive_link_to_newindex_take2 = "https://drive.google.com/uc?id=1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf"
dl_manager = tfds.download.DownloadManager(download_dir="./foo")
extracted_path = dl_manager.download(google_drive_link_to_newindex_take2)
# ends up printing "foo\ucid_1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk"
# which is 1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf (the ID from above, followed by "2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk" which I don't understand what that is)
print(extracted_path)
Implication here is that the strategy of doing URLs like "https://drive.google.com/uc?id=
OK, the next thing I want to figure out is how to actually download and load files.
DGS Corpus actually includes a "dgs.json" in github, about 440 kB in size.
When I open it up with Firefox, the format looks like this, looks like there's links to files in there.
This is most similar, I think, to our "newindex.list.gz", in that there's a list of unique data items, with URL links to videos.
Here's my notes on newindex.list.gz (drive link):
Keys: ['video_url', 'video_name', 'verse_lang', 'verse_name', 'verse_start', 'verse_end', 'duration', 'verse_unique', 'verseID']
First 10:
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/f3/nwt_01_Ge_ALS_03_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_03_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 3:15', 'verse_start': '0.000000', 'verse_end': '31.198000', 'duration': 31.198, 'verse_unique': 'ALS Zan. 3:15', 'verseID': 'v1003015'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:2', 'verse_start': '0.000000', 'verse_end': '26.760000', 'duration': 26.76, 'verse_unique': 'ALS Zan. 39:2', 'verseID': 'v1039002'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:3', 'verse_start': '26.760000', 'verse_end': '47.848000', 'duration': 21.087999999999997, 'verse_unique': 'ALS Zan. 39:3', 'verseID': 'v1039003'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/10/nwt_03_Le_ALS_19_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_19_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 19:18', 'verse_start': '0.000000', 'verse_end': '32.399000', 'duration': 32.399, 'verse_unique': 'ALS Lev. 19:18', 'verseID': 'v3019018'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/0c/nwt_03_Le_ALS_25_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_25_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 25:10', 'verse_start': '0.000000', 'verse_end': '8.320000', 'duration': 8.32, 'verse_unique': 'ALS Lev. 25:10', 'verseID': 'v3025010'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:6', 'verse_start': '0.000000', 'verse_end': '7.341000', 'duration': 7.341, 'verse_unique': 'ALS Ligj. 6:6', 'verseID': 'v5006006'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:7', 'verse_start': '7.341000', 'verse_end': '24.024000', 'duration': 16.683, 'verse_unique': 'ALS Ligj. 6:7', 'verseID': 'v5006007'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/3d/nwt_05_De_ALS_10_r720P.mp4', 'video_name': 'nwt_05_De_ALS_10_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 10:20', 'verse_start': '0.000000', 'verse_end': '10.644000', 'duration': 10.644, 'verse_unique': 'ALS Ligj. 10:20', 'verseID': 'v5010020'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/34/nwt_05_De_ALS_32_r720P.mp4', 'video_name': 'nwt_05_De_ALS_32_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 32:4', 'verse_start': '0.000000', 'verse_end': '43.844000', 'duration': 43.844, 'verse_unique': 'ALS Ligj. 32:4', 'verseID': 'v5032004'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/1e/nwt_09_1Sa_ALS_01_r720P.mp4', 'video_name': 'nwt_09_1Sa_ALS_01_r720P', 'verse_lang': 'ALS', 'verse_name': '1 Sam. 1:15', 'verse_start': '0.000000', 'verse_end': '23.557000', 'duration': 23.557, 'verse_unique': 'ALS 1 Sam. 1:15', 'verseID': 'v9001015'}
Which is a pickled list, compressed with gzip.
In compressed form it is about 19,000 KB, or about 19 MB.
In decompressed form it's closer to 100MB.
(Side note: investigate Parquet data format?)
Another TODO at some point: Upload files to a better hosting platform, e.g. Zenodo or Zindi, to prevent issues with "too many downloads".
What are the numbers in DGS? Unique IDs? Should we generate some for our dataset?
And the JSON is created here: https://github.com/sign-language-processing/datasets/blob/e864f36ddc452587a80f7622630a0871cd406a0d/sign_language_datasets/datasets/dgs_corpus/create_index.py#L17, which calls the numbers "tr_id"
Ah... "transcript ID". And they're not generated in the Python code, they're parsed from the source page via regex.
https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md helpful guide for TFDS datasets.
Also apparently tfds.testing is a thing, used for example in dgs_corpus_test.py https://tensorflow.google.cn/datasets/api_docs/python/tfds/testing
https://tensorflow.google.cn/datasets/add_dataset?hl=en actually the helpful guide above is source code for this
Went and figured out how the index was created, and pushed an updated version of the create_index.py https://github.com/ShesterG/datasets/pull/1
Now that I know the data a bit better, gonna move on to filling out the Builder class for JWSign
From the presented slides for JWSign, this is what we're going for
Not being familiar with tfds
or sign_language_datasets
I am attempting to go with a "get basic functionality working and then test it" approach. But then I ran into the issue of not knowing how to test a dataset locally. #53 documents part of this, but the basic guide to testing is:
pip install pytest pytest-cov dill
to get the testing depspytest .
in whatever folder you want to run tests for, incl. the top-level.Of course the next question is how to make tests!
OK, so even if you just follow https://tensorflow.google.cn/datasets/add_dataset?hl=en#test_your_dataset and add nothing, you'll still get some basic unit tests:
OK, testing procedure:
conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .
All right, after messing around with #56, it seems that by deleting this file I am then able to run pytests.
I had also ran into another weird issue in #57, where pytest ran into an error while telling me what a different error was.
Now I can finally proceed with JW Sign dataset some more. Let's see if I can make a version which at least downloads the spoken-language text, and maybe make a much simplified index for testing purposes.
In order to iterate/test the dataset I will need to:
# in the top-level directory of the repo, with __init__.py removed
# make some change to the builder script
pytest ./sign_language_datasets/datasets/new_dataset/
# pip install . # not necessary actually, you can simply run the test
OK, I did
# navigate to sign_language_datasets/datasets/
tfds new new_dataset # create a new directory
And then repeatedly edited, pytested, using the rwth_phoenix2014_t
code as a base, until the VideoTest passed. Excellent.
OK, I'm starting from scratch. I made a new fork, this time forking off of https://github.com/sign-language-processing/datasets so that I'm up to date.
I want to see if I can make a completely basic text-only dataset to start.
Apparently Google Drive doesn't play nice. When I try to use tfds' download_and_extract
method on the text51.dict.gz file, I get a .html instead.
Turns out Google likes to pop up a "can't scan this for viruses" message, and that's what gets downloaded.
gdown
library works, but then that doesn't play with tfds
Here's my Colab notebook playing with it: https://colab.research.google.com/drive/1EMKnpKrDUHxq5COFM6Acm7PqAQmkdvTS?usp=sharing
Workaround: split the text dict into one for every spoken language
To get the links for all 51 .json files:
Go to the folder: https://drive.google.com/drive/folders/1r-ftcljPRm1kLasqCK_cYL9zc4mxE6_o
Select all the files, you have to scroll a bit because it only shows 50 by default
right-click->Share->copy links
https://drive.google.com/file/d/122Fs-e5O9SPELpE83FohdXky9r1QIpDB/view?usp=drive_link, https://drive.google.com/file/d/1-KptAhyZCfnxGG4OMOVHXfgFqfjhmduu/view?usp=drive_link, https://drive.google.com/file/d/1-QQKmrW0iI9lBLxihtnxOgT_xAQSWCt7/view?usp=drive_link, https://drive.google.com/file/d/12Ahgjl3wbho9ShwlGVdtUx-uDnGacPFn/view?usp=drive_link, https://drive.google.com/file/d/1DMTgyGq9td8XpWqMeez7Hv1Cy-UyIAmd/view?usp=drive_link, https://drive.google.com/file/d/1Dze5_WZyXkAq8gca9eiPnofDTbXiFCYy/view?usp=drive_link, https://drive.google.com/file/d/1FXZF2SMv4GZ4visJrUwxY9vnHIs2Gzli/view?usp=drive_link, https://drive.google.com/file/d/1PyUgjqq6gt5kf4Pa4fRJok8b5eech2An/view?usp=drive_link, https://drive.google.com/file/d/1jOInwyi6XAAkwJ5iMLc90NI9W8MlbARF/view?usp=drive_link, https://drive.google.com/file/d/10peSKPs99feSfsyWmkSmaONM8H2pGWEX/view?usp=drive_link, https://drive.google.com/file/d/1NoCcaKy_BPrlboXP5kx4_iP3DrOq0ITq/view?usp=drive_link, https://drive.google.com/file/d/1dNnGpaoMGR4IPhWuH-TIobDyeL3tcdHx/view?usp=drive_link, https://drive.google.com/file/d/1dqO8p1tD_UUXUw3pDMCYwOIAS2niKHgm/view?usp=drive_link, https://drive.google.com/file/d/1iwF3OKvo4WmWvlqXjoDqZ1j41C4qthjA/view?usp=drive_link, https://drive.google.com/file/d/1tXHY2m4-P_jD7I4xBvkbB3lUX8FXVNmi/view?usp=drive_link, https://drive.google.com/file/d/139UErDv_QeaAmm5n7l5IqsktC5b3hO5A/view?usp=drive_link, https://drive.google.com/file/d/1Z5iTuQGTl15oh_xm9cSrtJkkmfu9s7qz/view?usp=drive_link, https://drive.google.com/file/d/1kccYftVcapjNLXxE-VYZpIOVNcRoa2mI/view?usp=drive_link, https://drive.google.com/file/d/1r8ao3bUf4xcsyTqJp0AQBiM29Y2wXBcO/view?usp=drive_link, https://drive.google.com/file/d/1rY6VjXhQXL330uNpxmekrpi_xs3T2JeK/view?usp=drive_link, https://drive.google.com/file/d/1tOMJNzNYo-Bpo6rxDZW94tpBnH9lZadv/view?usp=drive_link, https://drive.google.com/file/d/13-C5Z3YFjEE4dpstt3hDbNORiGgB4BgP/view?usp=drive_link, https://drive.google.com/file/d/19-zA-4dsfB-LZcDWXiOKNnniNuEHCZh2/view?usp=drive_link, https://drive.google.com/file/d/1GR3A6NXnsoItIwaxQvflCXufhLV-xCz2/view?usp=drive_link, https://drive.google.com/file/d/1KApiflPkVm6Jn0sGw2OyRT__VAec_bFd/view?usp=drive_link, https://drive.google.com/file/d/1NibQFTL0gGUL9NYnYFjlk_uCIZSM-RqA/view?usp=drive_link, https://drive.google.com/file/d/1SINxYL1u2T-dG79TjQmj2AZsfQB8rTaC/view?usp=drive_link, https://drive.google.com/file/d/1wE9Po5-nrr9PS-xdT_F8WK-kDyYCDZAp/view?usp=drive_link, https://drive.google.com/file/d/1Df5j9YsEMdvNx9NR7zl1gE58mnIZkS06/view?usp=drive_link, https://drive.google.com/file/d/1K8HCwsEtdKba248wPxbZJcRDyP4NlfLp/view?usp=drive_link, https://drive.google.com/file/d/1hT0dqllsIUL5G6AKP_vA1bkzir1ZYT0q/view?usp=drive_link, https://drive.google.com/file/d/1mZLUo9k8VTRyPdvrJEduUxMQv4Cxdotk/view?usp=drive_link, https://drive.google.com/file/d/1sioyZKRvTfujJ0aeJYPOYAEXZz5pIpTp/view?usp=drive_link, https://drive.google.com/file/d/1ygGnPbz4ssjwXNyZRmOGgkbcFfOzHu1b/view?usp=drive_link, https://drive.google.com/file/d/1LOsj8qvhmyRtfmaimELeBIiD4xl1vFeW/view?usp=drive_link, https://drive.google.com/file/d/1NjK0YIAowCv5uMv4yEcKdi-dU6LoVwD6/view?usp=drive_link, https://drive.google.com/file/d/1Nv8ecYBPdogebdtGT4HcAGdRxI_hf-yI/view?usp=drive_link, https://drive.google.com/file/d/1_6nC34lGBDRSZVAM5msW4Ol-BbyvoMcK/view?usp=drive_link, https://drive.google.com/file/d/1eTHQKEotMJm20BKe--CLQfBqUUvlFZpo/view?usp=drive_link, https://drive.google.com/file/d/1jizBtuPzBA8Bcy5IMs-EeF2_q41A68zr/view?usp=drive_link, https://drive.google.com/file/d/1sJhcz_mwCGafr9hi0aLQkr1U91_cq6Qx/view?usp=drive_link, https://drive.google.com/file/d/1EOaMjlUVy-hNGLqH3zLfGtxSVRN0X9O1/view?usp=drive_link, https://drive.google.com/file/d/1HPz-ZDjJeomlqNpxc5cWsEO4P7liqiiU/view?usp=drive_link, https://drive.google.com/file/d/1_3H2N92wLAEIqi9VF735KeSPGqQtJE2Q/view?usp=drive_link, https://drive.google.com/file/d/1_C-cqwzEJI89tNjLlSsvHQoDgnBMELGr/view?usp=drive_link, https://drive.google.com/file/d/1gSlQrYvfB1m26npbNRYXP14idcn-_2aA/view?usp=drive_link, https://drive.google.com/file/d/1rrrn73YFhC4yjUwbPKxcSzaWQ8FzKXY2/view?usp=drive_link, https://drive.google.com/file/d/1z_KYeV2u0KgNOjZ_9ARnp4PnF_bJFkf5/view?usp=drive_link, https://drive.google.com/file/d/13wKXt6R4h_trTlDVtYXZRRPOFJ9omSnz/view?usp=drive_link, https://drive.google.com/file/d/1BQaj9_RC_lnsc3kJVSFKWw2o_hWp_X4c/view?usp=drive_link, https://drive.google.com/file/d/1P5fzxscp5uoq3AtxKthu29D7BYfV4nKe/view?usp=drive_link
And of course I can split those one by one and get the link in a format that tfds can download...
...except, how do I re-associate the filename with the link?
I suppose I could just add the key back in? Then I do still have to download all 51 files, but at least the relevant info will still be inside each one.
{"spoken_language": "de",
"data": {"v1001001": "1 \u00a0Am Anfang erschuf Gott Himmel und Erde.+", "v1001002": "2\u00a0\u00a0Die Erde nun war formlos und \u00f6de*. \u00dcber dem tief
}
OK going with that for now, we can compress them later. I just want to get something running
With a bit of munging I was able to download all the files, read the code, and then create a spoken_lang_text_file_download_urls.json dictionary of download URLs, which I saved to a .json
Gonna have to call it for today, but I added some notes to jw_sign.py for next time.
TODO: code to generate the .json files containing text for each spoken language, on demand. Those need to be re-scraped each time
We should add resources from JW, like the bible.