Command line parsing of *.json values #159

closed 2 years ago

commented 2 years ago

🐛 Bug report

Not sure if a bug or a feature, but when I call a script (e.g. PyTorch-Lightning CLI) with an argument like --dataset_path *.json, the parser reads the json file and interprets it as a configuration file (not as a dataset file in this case), and errors out because it is not a valid config file.

I can see there's documentation on parsing file paths, but cannot find any reference on reading them as string arguments.

Is it possible to disable this parsing? What other alternatives are there?

Thanks in advance

To reproduce

Create a data.json file containing some JSON dataset. E.g.:

import os
from typing import List, Union, Optional

import pandas as pd
from jsonargparse import ArgumentParser

PathLike = Union[str, os.PathLike]

class DataModule:
    def __init__(
        self, data_path: Optional[Union[pd.DataFrame, List[PathLike], PathLike]] = None
        self.dataset = data_path

class SubDataModule(DataModule):
    def __int__(
        self, data_path: Optional[Union[pd.DataFrame, List[PathLike], PathLike]] = None

parser = ArgumentParser()

# as initialized in pytorch_lightning.utilities.cli.add_lightning_class_args
parser.add_subclass_arguments(DataModule, "data", fail_untyped=False, required=True)

cfg = parser.parse_args(
    ["--data.class_path=SubDataModule", "--data.data_path=data.json"]

cls = parser.instantiate_classes(cfg)

Errors out with:

usage: [-h] [ CLASS_PATH_OR_NAME]
                  --data CONFIG | CLASS_PATH_OR_NAME | .INIT_ARG_NAME VALUE error: Parser key "data": Problem with given class_path "__main__.SubDataModule":
  - Configuration check failed :: Parser key "data_path": Value "{'columns': ['id', 'name'], 'data': [[0, 'John'], [1, 'Mary']]}" does not validate against any of the types in typing.Union[typing.List[typing.Union[str, os.PathLike]], str, os.PathLike, NoneType]:
    - Expected a List but got "{'columns': ['id', 'name'], 'data': [[0, 'John'], [1, 'Mary']]}"
    - Expected a <class 'str'> but got "{'columns': ['id', 'name'], 'data': [[0, 'John'], [1, 'Mary']]}"
    - Type <class 'os.PathLike'> expects: a class path (str); or a dict with a class_path entry; or a dict with init_args (if class path given previously). Got "{'columns': ['id', 'name'], 'data': [[0, 'John'], [1, 'Mary']]}".
    - Expected a <class 'NoneType'> but got "{'columns': ['id', 'name'], 'data': [[0, 'John'], [1, 'Mary']]}"

Curiously, it passes if I use a data.csv instead of JSON.

Expected behavior

The parser reads the arg value as a str.


commented 2 years ago

The parsing depends on the type hints. What is the type hint for dataset_path? If it is str, after parsing the value should be a string with whatever you have in the command line. If the type is a path like the docs you linked, the parsed value will not be str but a Path object.

commented 2 years ago

If you don't have type hints, then add them. This is how the parser knows how to validate. In LightningCLI it is configured such that when there is no type hint, it defaults to Any. If the type is Any then the parser does not know that it should be the path to a json file or its contents.

commented 2 years ago

In my LightningDataModule, the type hint for dataset is

dataframe_or_data_path: Optional[
    Union[pd.DataFrame, List[PathLike], PathLike]
] = None

where PathLike = Union[str, os.PathLike].

Could this be an issue?

commented 2 years ago

With the following:

import os
from jsonargparse import ArgumentParser
from typing import Any, List, Union

parser = ArgumentParser()
PathLike = Union[str, os.PathLike]
parser.add_argument('--path', type=Union[List[PathLike], PathLike])
cfg = parser.parse_args(['--path=issue_159.json'])

The result is correct Namespace(path='issue_159.json').

os.PathLike is not currently supported. But since the union has str there is no issue in that case. Support for os.PathLike can be added. But still this does not explain what you originally described. You will need to post a minimal reproducible script.

commented 2 years ago

@mauvilsa I updated the description with the code to reproduce. I have initialized the arguments with .add_subclass_arguments() as in pytorch-lightning v1.6.5.

commented 2 years ago

@LourencoVazPato thank you for adding the reproduction code. Unfortunately I have been unable to reproduce it. First I tried in a normal virtual environment and then I tried with poetry to be a close as possible to what you reported. In Ubuntu 20.04 I get:

$ poetry run python 
The currently activated Python version 3.8.10 is not supported by the project (^3.9).
Trying to find and use a compatible version. 
Using python3.9 (3.9.13)
Namespace(data=Namespace(class_path='__main__.SubDataModule', init_args=Namespace(data_path='data.json')))
Namespace(data=<__main__.SubDataModule object at 0x7f7c5f185c70>)

The pyproject.toml is the following:

name = "issue-159"
version = "0.1.0"
description = ""
authors = []
readme = ""
packages = [{include = "issue_159"}]

python = "^3.9"
jsonargparse = "4.13.0"
pandas = "^1.4.4"
pytorch-lightning = "1.6.5"

requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
commented 2 years ago

@mauvilsa I've been able to reproduce with this poetry env:

name = "issue-159"
version = "0.1.0"
description = ""
authors = []

python = "^3.9"
jsonargparse = {extras = ["signatures"], version = "4.13.0"}
pandas = "^1.4.4"
pytorch-lightning = "1.6.5"

requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

If you want to reproduce my python env here's the poetry lock.

commented 2 years ago

I had a silly mistake when trying to reproduce. I was able to reproduce it in a normal virtual environment and any python version. I have pushed the fix in commit