project-lighter / lighter

Config-based framework for organized and reproducible deep learning. MONAI Bundle + PyTorch Lightning.
https://project-lighter.github.io/lighter
MIT License
21 stars 2 forks source link

KeyError and ValueError Encountered While Running Quickstart Example in Python 3.8 and 3.10 Environments #119

Closed salujajustin closed 3 months ago

salujajustin commented 3 months ago

πŸ› Bug Report

When trying to reproduce the Quickstart example using pip install project-lighter and pip install project-lighter --pre, I encountered two errors. Both Python 3.8 and 3.10 environments return identical errors. Without the --pre option specified there is a ValueError: run ID 'run' doesn't exist in the config file. and with the --pre option there is a KeyError: "id='trainer' is not found in the config resolver.". See the attached screenshots for more details.

πŸ”¬ How To Reproduce

Steps to reproduce the behavior:

  1. Start with a fresh python 3.8 or 3.10 install in a virtual environment
  2. Install the package with pip install project-lighter or pip install project-lighter --pre.
  3. Create a cifar10.yaml config with the provided content.
  4. Run the command lighter fit --config_file cifar10.yaml.

Code sample

Step 1:

micromamba create -n ssl-thymus-lighter  
micromamba activate ssl-thymus-lighter
micromamba install python=3.8 -c conda-forge

Step 2:

pip install project-lighter

or:

pip install project-lighter --pre

Step 3:

system:
  _target_: lighter.LighterSystem
  batch_size: 512

  model:
    _target_: torchvision.models.resnet18
    num_classes: 10

  criterion:
    _target_: torch.nn.CrossEntropyLoss

  optimizer:
    _target_: torch.optim.Adam
    params: "$@system#model.parameters()"
    lr: 0.001

  datasets:
    train:
      _target_: torchvision.datasets.CIFAR10
      download: True
      root: .datasets
      train: True
      transform:
        _target_: torchvision.transforms.Compose
        transforms:
          - _target_: torchvision.transforms.ToTensor
          - _target_: torchvision.transforms.Normalize
            mean: [0.5, 0.5, 0.5]
            std: [0.5, 0.5, 0.5]

Step 4:

lighter fit --config_file cifar10.yaml

Environment

Screenshots

The following is the output for the lighter package from pip install project-lighter: image

The following is the output for the lighter package from pip install project-lighter --pre: image

πŸ“ˆ Expected behavior

The training should start without any errors when running lighter fit --config_file cifar10.yaml

πŸ“Ž Additional context

None.

github-actions[bot] commented 3 months ago

Hello @salujajustin, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue

dosubot[bot] commented 3 months ago

Hey, @salujajustin! πŸ‘‹ I'm here to help you with any bugs, questions, or contributions you might have. Let's squash those bugs together!

To resolve the errors encountered while running the Quickstart example for the project-lighter package, ensure that your configuration file is correctly set up and that you are using the correct command to run the experiment. Here are the steps to follow:

  1. Ensure the Configuration File is Correct: Make sure your cifar10.yaml file includes both the trainer and system sections as shown below:

    trainer:
      _target_: pytorch_lightning.Trainer
      max_epochs: 100
    
    system:
      _target_: lighter.LighterSystem
      batch_size: 512
    
      model:
        _target_: torchvision.models.resnet18
        num_classes: 10
    
      criterion:
        _target_: torch.nn.CrossEntropyLoss
    
      optimizer:
        _target_: torch.optim.Adam
        params: "$@system#model.parameters()"
        lr: 0.001
    
      datasets:
        train:
          _target_: torchvision.datasets.CIFAR10
          download: True
          root: .datasets
          train: True
          transform:
            _target_: torchvision.transforms.Compose
            transforms:
              - _target_: torchvision.transforms.ToTensor
              - _target_: torchvision.transforms.Normalize
                mean: [0.5, 0.5, 0.5]
                std: [0.5, 0.5, 0.5]
  2. Run the Command Correctly: Use the following command to run the experiment:

    lighter fit --config_file cifar10.yaml
  3. Ensure Proper Configuration Parsing: The parse_config function in lighter/utils/runner.py ensures that the configuration file is correctly parsed and merged. Make sure that your configuration file is specified correctly and that any additional arguments are provided as needed. The function checks for the presence of the config_file key and raises a ValueError if it is not specified.

    def parse_config(**kwargs) -> ConfigParser:
        if "config_file" not in kwargs:
            raise ValueError("--config_file not specified. Exiting.")
        parser = ConfigParser()
        parser.read_config(kwargs.pop("config_file"))
        parser.update(pairs=kwargs)
        project = parser.get("project", None)
        if project is not None:
            import_module_from_path("project", project)
        return parser
  4. Run the Trainer Method: The run_trainer_method function ensures that the trainer and system sections are correctly parsed and used. It sets the random seed, parses the configuration, and runs the specified Trainer method.

    def run_trainer_method(method: Dict, **kwargs: Any):
        seed_everything()
        parser = parse_config(**kwargs)
        trainer = parser.get_parsed_content("trainer")
        system = parser.get_parsed_content("system")
        config = parser.get()
        config.pop("_meta_")
        system.save_hyperparameters(config)
        if trainer.logger is not None:
            trainer.logger.log_hyperparams(config)
        if not hasattr(trainer, method):
            raise ValueError(f"Trainer has no method named {method}.")
        getattr(trainer, method)(system)

By ensuring that both the trainer and system sections are present in your YAML configuration file, using the correct command, and ensuring proper configuration parsing, you should be able to resolve the ValueError: run ID 'run' doesn't exist in the config file and KeyError: "id='trainer' is not found in the config resolver" errors [1][2][3].

#### Details **About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

To continue the conversation, mention @dosu.

ibro45 commented 3 months ago

Did you define a trainer in the config in addition to system? If that's the issue, I'll add better error handling to make it more obvious

surajpaib commented 3 months ago

I see the issue here. The docs have a missing trainer in the cifar10 example. I'll fix it over the weekend. @salujajustin I think you should follow the dosubot instructions 1 and 2 and hopefully this works.

@ibro45 We should use this as a chance to make better errors for our "reserved" keys.

salujajustin commented 3 months ago

Yes, I was confused by final yaml under Running this experiment with Lighter when it said "We just combine the Trainer and LighterSystem into a single YAML and run the command in the terminal as shown" as it was the same just the LighterSystem yaml config. Regardless, before posting, I did try and and combine them (by defining a trainer in addition to the system) in the same way as dosubot has suggested. However, I received the same ValueError as the screenshot above so I thought this was more of a systemic issue. (Going to use just python 3.10 moving forward here if that's ok)

Although this time, when running it with the --pre version I got a new error:

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0rc2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/home/justin/micromamba/envs/lighter-pre-p10/bin/lighter", line 5, in <module>
    from lighter.utils.cli import interface
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/lighter/__init__.py", line 5, in <module>
    _setup_logging()
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/lighter/utils/logging.py", line 72, in _setup_logging
    rich.traceback.install(show_locals=False, width=120, suppress=[__import__(name) for name in SUPPRESSED_MODULES])
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/lighter/utils/logging.py", line 72, in <listcomp>
    rich.traceback.install(show_locals=False, width=120, suppress=[__import__(name) for name in SUPPRESSED_MODULES])
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/monai/__init__.py", line 40, in <module>
    from .utils.module import load_submodules  # noqa: E402
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/monai/utils/__init__.py", line 18, in <module>
    from .deprecate_utils import DeprecatedError, deprecated, deprecated_arg, deprecated_arg_default
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/monai/utils/deprecate_utils.py", line 22, in <module>
    from monai.utils.module import version_leq
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/monai/utils/module.py", line 30, in <module>
    import torch
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/torch/__init__.py", line 1477, in <module>
    from .functional import *  # noqa: F403
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/home/justin/micromamba/envs/lighter-pre-p10/lib/python3.10/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
AttributeError: `np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead.

I'm guessing I shouldn't need to touch the code mentioned in dosubot instructions 3 & 4 right?

ibro45 commented 3 months ago

I don't think this is a lighter issue, I think this error has to do with the environment (numpy 2.0 seems to be pre-release, downgrade it <2).

surajpaib commented 3 months ago

@ibro45 While this isn't a lighter issue. I think if this is the first experience of people installing it with Python3.10, that's not good.

I will take a look into this more @salujajustin and see where the issue lies. Our CI should be adapted to keep these checks in place always.

ibro45 commented 3 months ago

Ah yes, I forgot that all dependencies weren't installed separately but with lighter, sorry

surajpaib commented 3 months ago

@salujajustin

The latest version pip install project-lighter --pre where the version is v0.0.1a29 should run the quickstart example now. I've also updated the example on lighter docs. Keep us updated on how it looks

salujajustin commented 3 months ago

Looks like that worked! Thanks again @surajpaib @ibro45 for your help. Side note, when installing with pip install project-lighter --pre with python 3.8 I got the version 0.0.2a27 not 0.0.1a29.

surajpaib commented 3 months ago

@salujajustin Thanks for checking. We've deprecated support for 3.8 now as its EOL very soon