google-research / spherical-cnn

Apache License 2.0
99 stars 3 forks source link

Training not starting #34

Open vjugor1 opened 1 month ago

vjugor1 commented 1 month ago

Hello!

Thank you for sharing the implementation. Unfortunately, I ran into a problem while I was following QM9 guide from README.md

Steps taken:

git clone https://github.com/google-research/spherical-cnn.git
cd spherical-cnn
# Create a docker container, download and install dependencies.
docker build -f dockerfile-qm9 -t spherical_cnn_qm9 .
# Start training.
docker run spherical_cnn_qm9 \
    --workdir=/tmp/training_logs \
    --config=spherical_cnn/spherical_mnist/configs/default \
    --config.per_device_batch_size=2

Result obtained (error log)

2024-05-31 14:01:22.865302: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I0531 14:01:40.676771 140518412662592 xla_bridge.py:622] Unable to initialize backend 'cuda': FAILED_PRECONDITION: No visible GPU devices.
I0531 14:01:40.677033 140518412662592 xla_bridge.py:622] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA
I0531 14:01:40.677349 140518412662592 xla_bridge.py:622] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
W0531 14:01:40.677425 140518412662592 xla_bridge.py:636] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
I0531 14:01:40.677516 140518412662592 main.py:53] JAX host: 0 / 1
I0531 14:01:40.677578 140518412662592 main.py:54] JAX devices: [CpuDevice(id=0)]
I0531 14:01:40.677884 140518412662592 local.py:45] Setting task status: process_index: 0, process_count: 1
I0531 14:01:40.678051 140518412662592 local.py:50] Created artifact workdir of type ArtifactType.DIRECTORY and value /tmp/training_logs.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/ml_collections/config_dict/config_dict.py", line 903, in __getitem__
    field = self._fields[key]
KeyError: 'per_batch_padding'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/ml_collections/config_dict/config_dict.py", line 827, in __getattr__
    return self[attribute]
  File "/opt/conda/lib/python3.10/site-packages/ml_collections/config_dict/config_dict.py", line 909, in __getitem__
    raise KeyError(self._generate_did_you_mean_message(key, str(e)))
KeyError: "'per_batch_padding'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workdir/spherical_cnn/molecules/main.py", line 67, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/workdir/spherical_cnn/molecules/main.py", line 61, in main
    train.train_and_evaluate(FLAGS.config, _WORKDIR.value)
  File "/workdir/spherical_cnn/molecules/train.py", line 500, in train_and_evaluate
    if not config.per_batch_padding:
  File "/opt/conda/lib/python3.10/site-packages/ml_collections/config_dict/config_dict.py", line 829, in __getattr__
    raise AttributeError(e)
AttributeError: "'per_batch_padding'"

I would be grateful if you fix this issue or give a hint since it is now not possible to reproduce your papers' results.

Thank you in advance!

Sincerely, SL

P.S. in README.md, there in docker run spherical_cnn_qm9 \ --workdir=/tmp/training_logs \ --config=spherical_cnn/spherical_mnist/configs/small.py \ --config.per_device_batch_size=2 lead to non-existing small.py config. Please consider changing it to default.py, which is provided in your repository.

vjugor1 commented 1 month ago

I also assumed that there is a typo in config and tried spherical_cnn/molecules/configs/small.py and spherical_cnn/molecules/configs/default.py ones. Such choice also produce errors, see below.

For both the error is:

2024-05-31 14:07:35.379065: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I0531 14:07:53.207865 140623350708032 xla_bridge.py:622] Unable to initialize backend 'cuda': FAILED_PRECONDITION: No visible GPU devices.
I0531 14:07:53.208091 140623350708032 xla_bridge.py:622] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA
I0531 14:07:53.208389 140623350708032 xla_bridge.py:622] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
W0531 14:07:53.208456 140623350708032 xla_bridge.py:636] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
I0531 14:07:53.208536 140623350708032 main.py:53] JAX host: 0 / 1
I0531 14:07:53.208586 140623350708032 main.py:54] JAX devices: [CpuDevice(id=0)]
I0531 14:07:53.208865 140623350708032 local.py:45] Setting task status: process_index: 0, process_count: 1
I0531 14:07:53.209006 140623350708032 local.py:50] Created artifact workdir of type ArtifactType.DIRECTORY and value /tmp/training_logs.
I0531 14:07:53.338257 140623350708032 load.py:212] Failed to load dataset "qm9", builder_kwargs "{'config': 'dimenet'}" from files: Could not find dataset files for: qm9. Make sure you have the correct permissions to access the dataset and that it has been generated in: ['/workdir/tensorflow_datasets']. If the dataset has configs, you might have to specify the config name.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workdir/spherical_cnn/molecules/main.py", line 67, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/workdir/spherical_cnn/molecules/main.py", line 61, in main
    train.train_and_evaluate(FLAGS.config, _WORKDIR.value)
  File "/workdir/spherical_cnn/molecules/train.py", line 504, in train_and_evaluate
    splits = input_pipeline.create_datasets(config, data_rng)
  File "/workdir/spherical_cnn/molecules/input_pipeline.py", line 106, in create_datasets
    return _create_dataset_qm9(config, data_rng)
  File "/workdir/spherical_cnn/molecules/input_pipeline.py", line 182, in _create_dataset_qm9
    dataset_builder = tfds.builder('qm9/dimenet')
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/tensorflow_datasets/core/logging/__init__.py", line 168, in __call__
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 225, in builder
    raise not_found_error
  File "/opt/conda/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 202, in builder
    cls = builder_cls(str(name))
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 124, in builder_cls
    cls = registered.imported_builder_cls(str(ds_name))
  File "/opt/conda/lib/python3.10/site-packages/tensorflow_datasets/core/registered.py", line 301, in imported_builder_cls
    raise DatasetNotFoundError(f'Dataset {name} not found.')
tensorflow_datasets.core.registered.DatasetNotFoundError: Dataset qm9 not found.
Available datasets:
        - abstract_reasoning
        - accentdb
        - aeslc
        - aflw2k3d
        - ag_news_subset
        - ai2_arc
        - ai2_arc_with_ir
        - amazon_us_reviews
        - anli
        - answer_equivalence
        - arc
        - asqa
        - asset
        - assin2
        - asu_table_top_converted_externally_to_rlds
        - austin_buds_dataset_converted_externally_to_rlds
        - austin_sailor_dataset_converted_externally_to_rlds
        - austin_sirius_dataset_converted_externally_to_rlds
        - bair_robot_pushing_small
        - bc_z
        - bccd
        - beans
        - bee_dataset
        - beir
        - berkeley_autolab_ur5
        - berkeley_cable_routing
        - berkeley_fanuc_manipulation
        - berkeley_gnm_cory_hall
        - berkeley_gnm_recon
        - berkeley_gnm_sac_son
        - berkeley_mvp_converted_externally_to_rlds
        - berkeley_rpt_converted_externally_to_rlds
        - big_patent
        - bigearthnet
        - billsum
        - binarized_mnist
        - binary_alpha_digits
        - ble_wind_field
        - blimp
        - booksum
        - bool_q
        - bot_adversarial_dialogue
        - bridge
        - bucc
        - c4
        - c4_wsrs
        - caltech101
        - caltech_birds2010
        - caltech_birds2011
        - cardiotox
        - cars196
        - cassava
        - cats_vs_dogs
        - celeb_a
        - celeb_a_hq
        - cfq
        - cherry_blossoms
        - chexpert
        - cifar10
        - cifar100
        - cifar100_n
        - cifar10_1
        - cifar10_corrupted
        - cifar10_h
        - cifar10_n
        - citrus_leaves
        - cityscapes
        - civil_comments
        - clevr
        - clic
        - clinc_oos
        - cmaterdb
        - cmu_franka_exploration_dataset_converted_externally_to_rlds
        - cmu_play_fusion
        - cmu_stretch
        - cnn_dailymail
        - coco
        - coco_captions
        - coil100
        - colorectal_histology
        - colorectal_histology_large
        - columbia_cairlab_pusht_real
        - common_voice
        - conll2002
        - conll2003
        - controlled_noisy_web_labels
        - coqa
        - corr2cause
        - cos_e
        - cosmos_qa
        - covid19
        - covid19sum
        - crema_d
        - criteo
        - cs_restaurants
        - curated_breast_imaging_ddsm
        - cycle_gan
        - d4rl_adroit_door
        - d4rl_adroit_hammer
        - d4rl_adroit_pen
        - d4rl_adroit_relocate
        - d4rl_antmaze
        - d4rl_mujoco_ant
        - d4rl_mujoco_halfcheetah
        - d4rl_mujoco_hopper
        - d4rl_mujoco_walker2d
        - dart
        - databricks_dolly
        - davis
        - deep1b
        - deep_weeds
        - definite_pronoun_resolution
        - dementiabank
        - diabetic_retinopathy_detection
        - diamonds
        - div2k
        - dlr_edan_shared_control_converted_externally_to_rlds
        - dlr_sara_grid_clamp_converted_externally_to_rlds
        - dlr_sara_pour_converted_externally_to_rlds
        - dmlab
        - doc_nli
        - dolphin_number_word
        - domainnet
        - downsampled_imagenet
        - drop
        - dsprites
        - dtd
        - duke_ultrasound
        - e2e_cleaned
        - efron_morris75
        - emnist
        - eraser_multi_rc
        - esnli
        - eth_agent_affordances
        - eurosat
        - fashion_mnist
        - flic
        - flores
        - food101
        - forest_fires
        - fractal20220817_data
        - fuss
        - gap
        - geirhos_conflict_stimuli
        - gem
        - genomics_ood
        - german_credit_numeric
        - gigaword
        - glove100_angular
        - glue
        - goemotions
        - gov_report
        - gpt3
        - gref
        - groove
        - grounded_scan
        - gsm8k
        - gtzan
        - gtzan_music_speech
        - hellaswag
        - higgs
        - hillstrom
        - horses_or_humans
        - howell
        - i_naturalist2017
        - i_naturalist2018
        - i_naturalist2021
        - iamlab_cmu_pickup_insert_converted_externally_to_rlds
        - imagenet2012
        - imagenet2012_corrupted
        - imagenet2012_fewshot
        - imagenet2012_multilabel
        - imagenet2012_real
        - imagenet2012_subset
        - imagenet_a
        - imagenet_lt
        - imagenet_pi
        - imagenet_r
        - imagenet_resized
        - imagenet_sketch
        - imagenet_v2
        - imagenette
        - imagewang
        - imdb_reviews
        - imperialcollege_sawyer_wrist_cam
        - irc_disentanglement
        - iris
        - istella
        - jaco_play
        - kaist_nonprehensile_converted_externally_to_rlds
        - kddcup99
        - kitti
        - kmnist
        - kuka
        - laion400m
        - lambada
        - lfw
        - librispeech
        - librispeech_lm
        - libritts
        - ljspeech
        - lm1b
        - locomotion
        - lost_and_found
        - lsun
        - lvis
        - malaria
        - maniskill_dataset_converted_externally_to_rlds
        - math_dataset
        - math_qa
        - mctaco
        - media_sum
        - mlqa
        - mnist
        - mnist_corrupted
        - movie_lens
        - movie_rationales
        - movielens
        - moving_mnist
        - mrqa
        - mslr_web
        - mt_opt
        - mtnt
        - multi_news
        - multi_nli
        - multi_nli_mismatch
        - natural_instructions
        - natural_questions
        - natural_questions_open
        - newsroom
        - nsynth
        - nyu_depth_v2
        - nyu_door_opening_surprising_effectiveness
        - nyu_franka_play_dataset_converted_externally_to_rlds
        - nyu_rot_dataset_converted_externally_to_rlds
        - ogbg_molpcba
        - omniglot
        - open_images_challenge2019_detection
        - open_images_v4
        - openbookqa
        - opinion_abstracts
        - opinosis
        - opus
        - oxford_flowers102
        - oxford_iiit_pet
        - para_crawl
        - pass
        - patch_camelyon
        - paws_wiki
        - paws_x_wiki
        - penguins
        - pet_finder
        - pg19
        - piqa
        - places365_small
        - placesfull
        - plant_leaves
        - plant_village
        - plantae_k
        - protein_net
        - q_re_cc
        - qa4mre
        - qasc
        - quac
        - quality
        - quickdraw_bitmap
        - race
        - radon
        - real_toxicity_prompts
        - reddit
        - reddit_disentanglement
        - reddit_tifu
        - ref_coco
        - resisc45
        - rlu_atari
        - rlu_atari_checkpoints
        - rlu_atari_checkpoints_ordered
        - rlu_control_suite
        - rlu_dmlab_explore_object_rewards_few
        - rlu_dmlab_explore_object_rewards_many
        - rlu_dmlab_rooms_select_nonmatching_object
        - rlu_dmlab_rooms_watermaze
        - rlu_dmlab_seekavoid_arena01
        - rlu_locomotion
        - rlu_rwrl
        - robomimic_mg
        - robomimic_mh
        - robomimic_ph
        - robonet
        - robosuite_panda_pick_place_can
        - roboturk
        - rock_paper_scissors
        - rock_you
        - s3o4d
        - salient_span_wikipedia
        - samsum
        - savee
        - scan
        - scene_parse150
        - schema_guided_dialogue
        - sci_tail
        - scicite
        - scientific_papers
        - scrolls
        - segment_anything
        - sentiment140
        - shapes3d
        - sift1m
        - simpte
        - siscore
        - smallnorb
        - smartwatch_gestures
        - snli
        - so2sat
        - speech_commands
        - spoken_digit
        - squad
        - squad_question_generation
        - stanford_dogs
        - stanford_hydra_dataset_converted_externally_to_rlds
        - stanford_kuka_multimodal_dataset_converted_externally_to_rlds
        - stanford_mask_vit_converted_externally_to_rlds
        - stanford_online_products
        - stanford_robocook_converted_externally_to_rlds
        - star_cfq
        - starcraft_video
        - stl10
        - story_cloze
        - summscreen
        - sun397
        - super_glue
        - svhn_cropped
        - symmetric_solids
        - taco_play
        - tao
        - tatoeba
        - ted_hrlr_translate
        - ted_multi_translate
        - tedlium
        - tf_flowers
        - the300w_lp
        - tiny_shakespeare
        - titanic
        - tokyo_u_lsmo_converted_externally_to_rlds
        - toto
        - trec
        - trivia_qa
        - tydi_qa
        - uc_merced
        - ucf101
        - ucsd_kitchen_dataset_converted_externally_to_rlds
        - ucsd_pick_and_place_dataset_converted_externally_to_rlds
        - uiuc_d3field
        - unified_qa
        - universal_dependencies
        - unnatural_instructions
        - usc_cloth_sim_converted_externally_to_rlds
        - user_libri_audio
        - user_libri_text
        - utaustin_mutex
        - utokyo_pr2_opening_fridge_converted_externally_to_rlds
        - utokyo_pr2_tabletop_manipulation_converted_externally_to_rlds
        - utokyo_saytap_converted_externally_to_rlds
        - utokyo_xarm_bimanual_converted_externally_to_rlds
        - utokyo_xarm_pick_and_place_converted_externally_to_rlds
        - vctk
        - viola
        - visual_domain_decathlon
        - voc
        - voxceleb
        - voxforge
        - waymo_open_dataset
        - web_graph
        - web_nlg
        - web_questions
        - webvid
        - wider_face
        - wiki40b
        - wiki_auto
        - wiki_bio
        - wiki_dialog
        - wiki_table_questions
        - wiki_table_text
        - wikiann
        - wikihow
        - wikipedia
        - wikipedia_toxicity_subtypes
        - wine_quality
        - winogrande
        - wit
        - wit_kaggle
        - wmt13_translate
        - wmt14_translate
        - wmt15_translate
        - wmt16_translate
        - wmt17_translate
        - wmt18_translate
        - wmt19_translate
        - wmt_t2t_translate
        - wmt_translate
        - wordnet
        - wsc273
        - xnli
        - xquad
        - xsum
        - xtreme_pawsx
        - xtreme_pos
        - xtreme_s
        - xtreme_xnli
        - yahoo_ltrc
        - yelp_polarity_reviews
        - yes_no
        - youtube_vis

Check that:
    - if dataset was added recently, it may only be available
      in `tfds-nightly`
    - the dataset name is spelled correctly
    - dataset class defines all base class abstract methods
    - the module defining the dataset class is imported

The builder directory /workdir/tensorflow_datasets/qm9/dimenet doesn't contain any versions.
No builder could be found in the directory: /workdir/tensorflow_datasets for the builder: qm9.
No registered data_dirs were found in:
        - /workdir/tensorflow_datasets