Closed cleong110 closed 8 months ago
I thought I had this figured out using the following procedure:
conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .
But when I follow the above steps, creating a new conda env and installing from source, I still get this.
OK, seriously, how does all this registering stuff work?
Having a look at the original tensorflow_datasets
library, we've got:
https://tensorflow.google.cn/datasets/common_gotchas?hl=en#new_datasets_should_be_self-contained_in_a_folder mentions something to do with paths
https://www.tensorflow.org/datasets/add_dataset#add_an_import_for_registration talks about registration
https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/registered.py is where the magic happens
https://github.com/tensorflow/datasets/issues/552 also talks about how to unregister
Does it occur in Colab?
Yes: https://colab.research.google.com/drive/160LWW9pamKpkcb0YzQrHgamRNfqNpFsx?usp=sharing
So it's not just my machine.
I mean the issue stack traces show
sign_language_datasets\datasets\__init__.py:1: in <module>
from .aslg_pc12 import AslgPc12
sign_language_datasets\datasets\aslg_pc12\__init__.py:3: in <module>
from .aslg_pc12 import AslgPc12
So maybe it shouldn't be in both places?
OK, If I...
So my https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py looks like
from .config import SignDatasetConfig
# from .aslg_pc12 import AslgPc12
# from .asl_lex import AslLex
# from .autsl import AUTSL
# from .chicagofswild import ChicagoFSWild
# from .dgs_types import DgsTypes
# from .dgs_corpus import DgsCorpus
# from .dicta_sign import DictaSign
# from .how2sign import How2Sign
# from .mediapi_skel import MediapiSkel
# from .rwth_phoenix2014_t import RWTHPhoenix2014T
# from .sign2mint import Sign2MINT
# from .signbank import SignBank
# from .signtyp import SignTyp
# from .signsuisse import SignSuisse
# from .sign_wordnet import SignWordnet
# from .swojs_glossario import SwojsGlossario
# from .wlasl import Wlasl
# from .ngt_corpus import NGTCorpus
# from .bsl_corpus import BslCorpus
# from .wmt_slt import WMTSLT
and then I run
python -m pip install . && pytest .
hmmmm the problem with commenting all datasets from this file, is that when we want to use a dataset, we use the following import, which registers all datasets
import sign_language_datasets.datasets
I'm ok with not doing this, and instead, whenever using a dataset, for example autsl
, having to import
import sign_language_datasets.datasets.autsl
Though it might cause confusion and users not finding the right dataset.
What do you think?
Well, I thought I'd go back and see what tfds
does to handle all this, and I have noticed a few things.
import tensorflow_datasets as tfds
import tensorflow as tf
ds = tfds.load('mnist', split='train', as_supervised=True, shuffle_files=True)
2. They structure their repo and `__init__.py` files a bit differently. Most notably they have:
- [None in the top-level: ](https://github.com/tensorflow/datasets/tree/master)
- [One in the `tensorflow_datasets` subdirectory ](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/__init__.py), which imports the subcategories/subfolders such as `image_classification`, with a try-except built-in
![image](https://github.com/sign-language-processing/datasets/assets/122366389/833e77e5-92f0-4287-bb61-805a79cb1fc5)
- then the [`__init__.py__` in the subcategory folder](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/__init__.py) finally imports the actual dataset, most of which don't have their own subfolder, just a .py...
![image](https://github.com/sign-language-processing/datasets/assets/122366389/2ef3b2d0-6347-468a-9130-1200abeca4ec)
- But some do, and they also have an [`__init__.py` with a full path](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/cifar10_n/__init__.py), which also gets imported one level up in the category folder:
![image](https://github.com/sign-language-processing/datasets/assets/122366389/5756272e-76ad-4479-a2e6-85bfd651c875)
![image](https://github.com/sign-language-processing/datasets/assets/122366389/f03d1bfc-9756-44dd-9123-16e5d9a93cef)
The other thing I tested was whether I could get pytest running on their repo, and yes I was able to, without this import issue.
So translating it to sign_language_datasets
, if we wanted the layout to be identical we would:
datasets
to use full paths: https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/__init__.py, e.g. sign_language_datasets.datasets.aslg_pc12
to match the pattrn of the subcategory inits like image_classification
I guess I'll give it a whirl and see?
hmmm, actually, just removing https://github.com/sign-language-processing/datasets/blob/master/__init__.py might be sufficient. I did
git clone https://github.com/sign-language-processing/datasets.git
cd datasets
conda create -n sign_language_datasets_source pip python=3.10
conda activate sign_language_datasets_source
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest . # Gives me the import errors
rm __init__.py
pytest . # it works
Made a .py file to load a dataset, and it seems to work in the expected way:
load_phoenix.py:
import tensorflow_datasets as tfds
import sign_language_datasets.datasets
from sign_language_datasets.datasets.config import SignDatasetConfig
import itertools
if __name__ == "__main__":
config = SignDatasetConfig(name="only-annotations", version="3.0.0", include_video=False)
rwth_phoenix2014_t = tfds.load(name='rwth_phoenix2014_t', builder_kwargs=dict(config=config))
for datum in itertools.islice(rwth_phoenix2014_t["train"], 0, 10):
print(datum['gloss'].numpy().decode('utf-8'))
print(datum['text'].numpy().decode('utf-8'))
print()
pip install . # make sure I'm using the edited version with the init gone
python load_phoenix.py # it loads
So I guess maybe that __init__.py
at the top of the repo was causing the double-registration issue? I'm still not fully confident I understand the process of what's going on, but removing it seems to make things work without causing any issues I know of
If you make a PR I could see exactly what you did there. I am fine with following tfds
Made a pull request!
resolved, pull request merged
Originally posted by @cleong110 in https://github.com/sign-language-processing/datasets/issues/53#issuecomment-1957375266