Document several datasets noticed during my reading: SignBD, Irish Sign Language, ArASL2018, BOBSL

cleong110 commented 6 months ago

Been sifting through the literature, ran into some datasets.

SignBD, Irish Sign Language, ArASL2018, and esp BOBSL which seems quite notable/large could be added to the documentation

Would just need to go to https://github.com/sign-language-processing/sign-language-processing.github.io/tree/master/src/datasets and add JSON for each, then submit a pull request.

I should be able to take this on quick.

cleong110 commented 6 months ago

Hmmm, what're the possible fields here... not all the JSONs have the same ones...

cleong110 commented 6 months ago

The "pub" field describes the paper if there is one apparently, not sure where the key comes from, e.g. in https://github.com/sign-language-processing/sign-language-processing.github.io/blob/master/src/datasets/AUTSL.json

cleong110 commented 6 months ago

OK if I wanted to do BOBSL for a start, I could take the one, say the AUTSL JSON, and edit, and I'd get something like:

{
  "pub": {
    "name": "BOBSL",
    "year": 2022,
    "publication": "dataset:momeniAutomaticDenseAnnotation2022",
    "url": "https://www.robots.ox.ac.uk/~vgg/data/bobsl/"
  },
  "features": [
    "video:RGB",
    "text:English"
  ],
  "loader": "",
  "language": "British Sign Language (BSL)",
  "#items": 1940,
  "#samples": "1.2M Sentences",
  "#signers": 37,
  "license": "non-commercial authorized academics",
  "licenseUrl": "https://www.bbc.co.uk/rd/projects/extol-dataset",
  "contact": "albanie[AT]robots.ox.ac.uk"
}

cleong110 commented 6 months ago

I'm not sure what to put for "loader", and the "dataset:momeniAutomaticDenseAnnotation2022" bit is a pure guess on my part

cleong110 commented 6 months ago

Ah, looks like the key comes from the references.bib. In which case I'd need to add the following to there:

@inproceedings{momeniAutomaticDenseAnnotation2022,
  title = {Automatic {{Dense Annotation}} of~{{Large-Vocabulary Sign Language Videos}}},
  booktitle = {Computer {{Vision}} -- {{ECCV}} 2022},
  author = {Momeni, Liliane and Bull, Hannah and Prajwal, K. R. and Albanie, Samuel and Varol, G{\"u}l and Zisserman, Andrew},
  editor = {Avidan, Shai and Brostow, Gabriel and Ciss{\'e}, Moustapha and Farinella, Giovanni Maria and Hassner, Tal},
  year = {2022},
  pages = {671--690},
  publisher = {Springer Nature Switzerland},
  address = {Cham},
  doi = {10.1007/978-3-031-19833-5_39},
  abstract = {Recently, sign language researchers have turned to sign language interpreted TV broadcasts, comprising (i) a video of continuous signing and (ii) subtitles corresponding to the audio content, as a readily available and large-scale source of training data. One key challenge in the usability of such data is the lack of sign annotations. Previous work exploiting such weakly-aligned data only found sparse correspondences between keywords in the subtitle and individual signs. In this work, we propose a simple, scalable framework to vastly increase the density of automatic annotations. Our contributions are the following: (1)~we significantly improve previous annotation methods by making use of synonyms and subtitle-signing alignment; (2)~we show the value of pseudo-labelling from a sign recognition model as a way of sign spotting; (3)~we propose a novel approach for increasing our annotations of known and unknown classes based on in-domain exemplars; (4)~on the BOBSL BSL sign language corpus, we increase the number of confident automatic annotations from 670K to 5M. We make these annotations publicly available to support the sign language research community.},
  isbn = {978-3-031-19833-5},
  langid = {english},
  keywords = {Automatic dataset construction,Novel class discovery,Sign language recognition}
}

cleong110 commented 6 months ago

Ah, it seems "loader" is optional actually.

cleong110 commented 6 months ago

So the JSON can just look like:

{
  "pub": {
    "name": "BOBSL",
    "year": 2022,
    "publication": "dataset:momeniAutomaticDenseAnnotation2022",
    "url": "https://www.robots.ox.ac.uk/~vgg/data/bobsl/"
  },
  "features": [
    "video:RGB",
    "text:English"
  ],
  "language": "British Sign Language (BSL)",
  "#items": 1940,
  "#samples": "1.2M Sentences",
  "#signers": 37,
  "license": "non-commercial authorized academics",
  "licenseUrl": "https://www.bbc.co.uk/rd/projects/extol-dataset",
  "contact": "Samuel Albanie albanie[AT]robots.ox.ac.uk"
}

cleong110 commented 6 months ago

OK, I went and parsed all the json files to a schema with https://pypi.org/project/genson/, and it spit this out: datasets_schema.json

tl;dr not all jsons have all fields, but the required ones (the ones common to every JSON) seem to be:

{
  "pub": {
    "name": string,
    "year": integer or null,
    "publication":string or null,
    "url": string or null,
  },

  "#items": integer or null,
  "#samples": string or null,
  "#signers": integer or string or null,
  "features": array of strings,
  "language": string,
  "license": string or null,
  "licenseUrl": string or null,
}

cleong110 commented 6 months ago

https://www.reddit.com/r/ProgrammerHumor/comments/t5qsj3/was_it_even_worth/

cleong110 commented 6 months ago

With annotations:

{
  "pub": {
    "name": string, # this gets used as the name of the dataset, e.g. "WLASL"
    "year": integer or null,
    "publication":string or null, # this matches a key in references.bib
    "url": string or null,
  },

  "#items": integer or null, # this is the number of unique signs in the column
  "#samples": string or null,
  "#signers": integer or string or null, # number of unique signers
  "features": array of strings, # I've seen things like "mouthing", "video:RGB", "pose:Kinect", "pose:OpenPose","text:Polish", "gloss:ASL", "writing:HamNoSys"
  "language": string, # the Sign language or languages, e.g. "American Sign Language (ASL)"
  "license": string or null,
  "licenseUrl": string or null,
}

cleong110 commented 6 months ago

OK, I pushed BOBSL and SignBD

cleong110 commented 6 months ago

And ISL-HS.

I think ArASL2018 is out of scope actually, it seems to be static images only.

cleong110 commented 6 months ago

Also because I spent way too much time on it, here's my scripts for trying to automate/simplify the process which needs to be done manually anyway, sigh:

figure out the required fields from .json files:

import json
import typing
from pathlib import Path

from genson import SchemaBuilder

# Script by Colin Leong (cleong1@udayton.edu) that will 
# (1) read all the jsons in the datasets dir
# (2) save off a JSON of the schema, listing required values
def parse_jsons_to_schema() -> typing.Dict:
    builder = SchemaBuilder()
    # https://stackoverflow.com/a/46061872
    datasets_path = Path(__file__).resolve().parent / "datasets"
    dataset_jsons = Path.rglob(datasets_path, "*.json")
    fields = []
    types = []
    schema_string = ""
    for ds_json in dataset_jsons:

        print("\n\n*****************************")
        print(ds_json.name)
        with open(str(ds_json), "r") as ds_json_f:
            dataset_dict = json.load(ds_json_f)
            builder.add_object(dataset_dict)

            for key, value in dataset_dict.items():
                # print(key, type(value))
                fields.append(key)
                types.append(type(value))
            new_schema = builder.to_schema()
            new_schema_string = builder.to_json(indent=2)

            if new_schema_string != schema_string:

                # print("NEW SCHEMA!!!!!!!!!!!!!!!!!")
                # print(new_schema_string)
                schema_string = new_schema_string

    print("******************")
    print("all jsons parsed to schema with genson")

    print("SCHEMA: ")

    # unique_fields = set(fields)
    # print(f"unique_fields: {unique_fields}")
    # print(f"unique types: {set(types)}")
    schema = builder.to_schema()
    # print(schema)

    print(json.dumps(schema, indent=4))
    with open("datasets_schema.json", "w") as dataset_schema_file:
        json.dump(schema, dataset_schema_file)

if __name__ == "__main__":
    parse_jsons_to_schema()

cleong110 commented 6 months ago

Trying to make an "easy" user interface which would read in that schema and then prompt the user, way more complicated than I thought esp if I wanted to add in validation features, examples of valid values, and accomodate types like "string or null"

import click
import json
import typing
from pathlib import Path

from genson import SchemaBuilder

# Script by Colin Leong (cleong1@udayton.edu) that will
# (1) ask the user to fill out some fields based on genson-created dataset_schema.json
# (2) save off a JSON
# (3) check the .bib file for a matching citation and warn the user.

# https://www.reddit.com/r/Python/comments/uxqfia/menus_in_python/
# I _just_ want to prompt the user for the fields, and check data type
# options: click library, textual library, prompt-toolkit library, pytermgui
# click seems like a pain for what I want to do
# https://github.com/Textualize/textual seems complicated
# https://github.com/shade40/celx seems complicated
_known_features = ["mouthing", "video:RGB", "pose:Kinect", "pose:OpenPose","text:Polish", "gloss:ASL", "writing:HamNoSys"]

@click.command()
# @click.option(
#     "--pub_name",
#     prompt="Enter the name of the dataset, e.g. WLASL, AUTSL ",
#     help="Dataset name as a string, e.g.e.g. WLASL, AUTSL ",
# )
# @click.option(
#     "--pub_year",
#     prompt="Enter the year of publication: ",
#     help="Year of publication",
#     type=int,
# )
# @click.option(
#     "--num_items",
#     help="Number of items in the dataset",
#     type=int,
#     prompt="How many items in the dataset? If unknown put a value less than 0",
# )
# @click.option("--features", 
#               prompt=f"What features does it have? Separate with commas, e.g. \"{",".join(_known_features)}\". If you wish to leave empty enter a space", 
#               type=str,
#               default="",
#               )
def write_dataset_json(
    # num_items, pub_name,pub_year, features
    ):
    """Simple program that greets NAME for a total of COUNT times."""

    # TODO: load in the schema and do this all automatically!

    click.echo("Let's gather publication info")
    pub_keys = ["name", "year", "publication", "url"] 
    pub_types = [str, int, str, str]
    pub_dict = {}

    for key, type in zip(pub_keys, pub_types):
        value = click.prompt(f"Enter {key}", type=type)
        pub_dict[key] = value

    # pub_citation = click.prompt("Enter the citation key (should match a key in references.bib), e.g. dataset:vintar2012compiling ",type=str, default=None)

    data_dict = {"pub":pub_dict,}
    data_keys = ["#items",  "#samples",  "#signers",  "features",  "language",  "license",  "licenseUrl",]
    data_types = [int,      str,        str,            str,        str,        str,       str,]

    for key, type in zip(data_keys, data_types):
        value = click.prompt(f"Enter {key}", type=type)
        data_dict[key] = value

    if data_dict["features"].strip():
         data_dict["features"] = [x.strip() for x in data_dict["features"].split(',')]
    else: 
         data_dict["features"] = []
    # if num_items <= 0:
    #     num_items = None

    print(json.dumps(data_dict, indent=2))

    confirm_save = click.confirm(f"Save to JSON as {data_dict["pub"]["name"]}.json?")

    if confirm_save:
        with open(f"{data_dict["pub"]["name"]}.json", "w") as foo_f:
            json.dump(data_dict, foo_f)

if __name__ == "__main__":
    write_dataset_json()

sign-language-processing / sign-language-processing.github.io

Document several datasets noticed during my reading: SignBD, Irish Sign Language, ArASL2018, BOBSL #27