cleong110 commented 9 months ago

          OK, now I get 22 errors, but _from pytest_ we are making progress!

Originally posted by @cleong110 in https://github.com/sign-language-processing/datasets/issues/53#issuecomment-1957375266

cleong110 commented 9 months ago

I thought I had this figured out using the following procedure:

conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source 
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .

But when I follow the above steps, creating a new conda env and installing from source, I still get this.

cleong110 commented 9 months ago

OK, seriously, how does all this registering stuff work?

cleong110 commented 9 months ago

Having a look at the original tensorflow_datasets library, we've got:

cleong110 commented 9 months ago

https://tensorflow.google.cn/datasets/common_gotchas?hl=en#new_datasets_should_be_self-contained_in_a_folder mentions something to do with paths

cleong110 commented 9 months ago

https://www.tensorflow.org/datasets/add_dataset#add_an_import_for_registration talks about registration

cleong110 commented 9 months ago

https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/registered.py is where the magic happens

cleong110 commented 9 months ago

https://github.com/tensorflow/datasets/issues/552 also talks about how to unregister

cleong110 commented 9 months ago

Does it occur in Colab?

Yes: https://colab.research.google.com/drive/160LWW9pamKpkcb0YzQrHgamRNfqNpFsx?usp=sharing

So it's not just my machine.

cleong110 commented 9 months ago

I mean the issue stack traces show

sign_language_datasets\datasets\__init__.py:1: in <module>
    from .aslg_pc12 import AslgPc12
sign_language_datasets\datasets\aslg_pc12\__init__.py:3: in <module>
    from .aslg_pc12 import AslgPc12

So maybe it shouldn't be in both places?

cleong110 commented 9 months ago

OK, If I...

comment out everything in the https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py other than SignDatasetConfig
remember to pip install from source again Then I finally get actual PyTest Tests running again.

cleong110 commented 9 months ago

So my https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py looks like

from .config import SignDatasetConfig
# from .aslg_pc12 import AslgPc12
# from .asl_lex import AslLex
# from .autsl import AUTSL
# from .chicagofswild import ChicagoFSWild
# from .dgs_types import DgsTypes
# from .dgs_corpus import DgsCorpus
# from .dicta_sign import DictaSign
# from .how2sign import How2Sign
# from .mediapi_skel import MediapiSkel
# from .rwth_phoenix2014_t import RWTHPhoenix2014T
# from .sign2mint import Sign2MINT
# from .signbank import SignBank
# from .signtyp import SignTyp
# from .signsuisse import SignSuisse
# from .sign_wordnet import SignWordnet
# from .swojs_glossario import SwojsGlossario
# from .wlasl import Wlasl
# from .ngt_corpus import NGTCorpus
# from .bsl_corpus import BslCorpus
# from .wmt_slt import WMTSLT

and then I run

python -m pip install . && pytest .

AmitMY commented 9 months ago

hmmmm the problem with commenting all datasets from this file, is that when we want to use a dataset, we use the following import, which registers all datasets

import sign_language_datasets.datasets

I'm ok with not doing this, and instead, whenever using a dataset, for example autsl, having to import

import sign_language_datasets.datasets.autsl

Though it might cause confusion and users not finding the right dataset.

What do you think?

cleong110 commented 8 months ago

Well, I thought I'd go back and see what tfds does to handle all this, and I have noticed a few things.

Their example code doesn't have you import so specifically, it's just one top-level import and a .load method. So if we follow their practice it'd be more intuitive for people who are used to tfds.
```
import tensorflow_datasets as tfds
import tensorflow as tf
```

Construct a tf.data.Dataset

ds = tfds.load('mnist', split='train', as_supervised=True, shuffle_files=True)


2. They structure their repo and `__init__.py` files a bit differently. Most notably they have: 
- [None in the top-level: ](https://github.com/tensorflow/datasets/tree/master)
- [One in the `tensorflow_datasets` subdirectory ](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/__init__.py), which imports the subcategories/subfolders such as `image_classification`, with a try-except built-in
![image](https://github.com/sign-language-processing/datasets/assets/122366389/833e77e5-92f0-4287-bb61-805a79cb1fc5)
- then the [`__init__.py__` in the subcategory folder](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/__init__.py) finally imports the actual dataset, most of which don't have their own subfolder, just a .py... 
![image](https://github.com/sign-language-processing/datasets/assets/122366389/2ef3b2d0-6347-468a-9130-1200abeca4ec)
- But some do, and they also have an [`__init__.py` with a full path](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/cifar10_n/__init__.py), which also gets imported one level up in the category folder: 
![image](https://github.com/sign-language-processing/datasets/assets/122366389/5756272e-76ad-4479-a2e6-85bfd651c875)

![image](https://github.com/sign-language-processing/datasets/assets/122366389/f03d1bfc-9756-44dd-9123-16e5d9a93cef)

cleong110 commented 8 months ago

The other thing I tested was whether I could get pytest running on their repo, and yes I was able to, without this import issue.

cleong110 commented 8 months ago

So translating it to sign_language_datasets, if we wanted the layout to be identical we would:

remove this: https://github.com/sign-language-processing/datasets/blob/master/__init__.py
keep this, adding in the try-except? https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/__init__.py
change the imports in datasets to use full paths: https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py, e.g. sign_language_datasets.datasets.aslg_pc12 to match the pattrn of the subcategory inits like image_classification
Change the inits in individual dataset folders to be full paths as well. e.g. https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/wmt_slt/__init__.py should match the pattern from https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/cifar10_n/__init__.py

cleong110 commented 8 months ago

I guess I'll give it a whirl and see?

cleong110 commented 8 months ago

hmmm, actually, just removing https://github.com/sign-language-processing/datasets/blob/master/__init__.py might be sufficient. I did

git clone https://github.com/sign-language-processing/datasets.git
cd datasets
conda create -n sign_language_datasets_source pip python=3.10 
conda activate sign_language_datasets_source 
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest . # Gives me the import errors
rm __init__.py
pytest . # it works

cleong110 commented 8 months ago

Made a .py file to load a dataset, and it seems to work in the expected way:

load_phoenix.py:

import tensorflow_datasets as tfds
import sign_language_datasets.datasets
from sign_language_datasets.datasets.config import SignDatasetConfig

import itertools

if __name__ == "__main__":
    config = SignDatasetConfig(name="only-annotations", version="3.0.0", include_video=False)
    rwth_phoenix2014_t = tfds.load(name='rwth_phoenix2014_t', builder_kwargs=dict(config=config))

    for datum in itertools.islice(rwth_phoenix2014_t["train"], 0, 10):
        print(datum['gloss'].numpy().decode('utf-8'))
        print(datum['text'].numpy().decode('utf-8'))
        print()

pip install . # make sure I'm using the edited version with the init gone
python load_phoenix.py # it loads

cleong110 commented 8 months ago

replicated in Colab: https://colab.research.google.com/drive/1IOa0IPdmPa-S0iE6GF5QygUzEee88OVM?usp=sharing

cleong110 commented 8 months ago

So I guess maybe that __init__.py at the top of the repo was causing the double-registration issue? I'm still not fully confident I understand the process of what's going on, but removing it seems to make things work without causing any issues I know of

AmitMY commented 8 months ago

If you make a PR I could see exactly what you did there. I am fine with following tfds

cleong110 commented 8 months ago

Made a pull request!

cleong110 commented 8 months ago

resolved, pull request merged

sign-language-processing / datasets

Getting "ValueError: Dataset with name already registered" errors when trying to run PyTest #56

Construct a tf.data.Dataset