huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.17k stars 2.67k forks source link

Consider using "Sequence" instead of "List" #5354

Open tranhd95 opened 1 year ago

tranhd95 commented 1 year ago

Feature request

Hi, please consider using Sequence type annotation instead of List in function arguments such as in Dataset.from_parquet(). It leads to type checking errors, see below.

How to reproduce

list_of_filenames = ["foo.parquet", "bar.parquet"]
ds = Dataset.from_parquet(list_of_filenames)

Expected mypy output:

Success: no issues found

Actual mypy output:

test.py:19: error: Argument 1 to "from_parquet" of "Dataset" has incompatible type "List[str]"; expected "Union[Union[str, bytes, PathLike[Any]], List[Union[str, bytes, PathLike[Any]]]]"  [arg-type]
test.py:19: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
test.py:19: note: Consider using "Sequence" instead, which is covariant

Env: mypy 0.991, Python 3.10.0, datasets 2.7.1

mariosasko commented 1 year ago

Hi! Linking a comment to provide more info on the issue: https://stackoverflow.com/a/39458225. This means we should replace all (most of) the occurrences of List with Sequence in function signatures.

@tranhd95 Would you be interested in submitting a PR?

dantema commented 1 year ago

Hi all! I tried to reproduce this issue and didn't work for me. Also in your example i noticed that the variables have different names: list_of_filenames and list_of_files, could this be related to that?

#I found random data in parquet format:
!wget "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata1.parquet"
!wget "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata2.parquet"

#Then i try reproduce
list_of_files = ["userdata1.parquet", "userdata2.parquet"]
ds = Dataset.from_parquet(list_of_files)

My output:

WARNING:datasets.builder:Using custom data configuration default-e287d097dc54e046
Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-e287d097dc54e046/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 100%
1/1 [00:00<00:00, 40.38it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.43it/s]
Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-e287d097dc54e046/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.

P.S. This is my first experience with open source. So do not judge strictly if I do not understand something)

tranhd95 commented 1 year ago

@dantema There is indeed a typo in variable names. Nevertheless, I'm sorry if I was not clear but the output is from mypy type checker. You can run the code snippet without issues. The problem is with the type checking.

tranhd95 commented 1 year ago

However, I found out that the type annotation is actually misleading. The from_parquet method should also accept list of PathLike objects which includes os.PathLike. But if I would ran the code snippet below, an exception is thrown.

Code

from pathlib import Path

list_of_filenames = [Path("foo.parquet"), Path("bar.parquet")]
ds = Dataset.from_parquet(list_of_filenames)

Output

[/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in from_parquet(path_or_paths, split, features, cache_dir, keep_in_memory, columns, **kwargs)
   1071         from .io.parquet import ParquetDatasetReader
   1072 
-> 1073         return ParquetDatasetReader(
   1074             path_or_paths,
   1075             split=split,

[/usr/local/lib/python3.8/dist-packages/datasets/io/parquet.py](https://localhost:8080/#) in __init__(self, path_or_paths, split, features, cache_dir, keep_in_memory, streaming, **kwargs)
     35         path_or_paths = path_or_paths if isinstance(path_or_paths, dict) else {self.split: path_or_paths}
     36         hash = _PACKAGED_DATASETS_MODULES["parquet"][1]
---> 37         self.builder = Parquet(
     38             cache_dir=cache_dir,
     39             data_files=path_or_paths,

[/usr/local/lib/python3.8/dist-packages/datasets/builder.py](https://localhost:8080/#) in __init__(self, cache_dir, config_name, hash, base_path, info, features, use_auth_token, repo_id, data_files, data_dir, name, **config_kwargs)
    298 
    299         if data_files is not None and not isinstance(data_files, DataFilesDict):
--> 300             data_files = DataFilesDict.from_local_or_remote(
    301                 sanitize_patterns(data_files), base_path=base_path, use_auth_token=use_auth_token
    302             )

[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
    794         for key, patterns_for_key in patterns.items():
    795             out[key] = (
--> 796                 DataFilesList.from_local_or_remote(
    797                     patterns_for_key,
    798                     base_path=base_path,

[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
    762     ) -> "DataFilesList":
    763         base_path = base_path if base_path is not None else str(Path().resolve())
--> 764         data_files = resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
    765         origin_metadata = _get_origin_metadata_locally_or_by_urls(data_files, use_auth_token=use_auth_token)
    766         return cls(data_files, origin_metadata)

[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
    357     data_files = []
    358     for pattern in patterns:
--> 359         if is_remote_url(pattern):
    360             data_files.append(Url(pattern))
    361         else:

[/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py](https://localhost:8080/#) in is_remote_url(url_or_filename)
     62 
     63 def is_remote_url(url_or_filename: str) -> bool:
---> 64     parsed = urlparse(url_or_filename)
     65     return parsed.scheme in ("http", "https", "s3", "gs", "hdfs", "ftp")
     66 

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in urlparse(url, scheme, allow_fragments)
    373     Note that we don't break the components up in smaller bits
    374     (e.g. netloc is a single string) and we don't expand % escapes."""
--> 375     url, scheme, _coerce_result = _coerce_args(url, scheme)
    376     splitresult = urlsplit(url, scheme, allow_fragments)
    377     scheme, netloc, url, query, fragment = splitresult

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in _coerce_args(*args)
    125     if str_input:
    126         return args + (_noop,)
--> 127     return _decode_args(args) + (_encode_result,)
    128 
    129 # Result objects are more helpful than simple tuples

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in _decode_args(args, encoding, errors)
    109 def _decode_args(args, encoding=_implicit_encoding,
    110                        errors=_implicit_errors):
--> 111     return tuple(x.decode(encoding, errors) if x else '' for x in args)
    112 
    113 def _coerce_args(*args):

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in <genexpr>(.0)
    109 def _decode_args(args, encoding=_implicit_encoding,
    110                        errors=_implicit_errors):
--> 111     return tuple(x.decode(encoding, errors) if x else '' for x in args)
    112 
    113 def _coerce_args(*args):

AttributeError: 'PosixPath' object has no attribute 'decode'

@mariosasko Should I create a new issue?

avinashsai commented 1 year ago

@mariosasko I would like to take this issue up.

mariosasko commented 1 year ago

@avinashsai Hi, I've assigned you the issue.

@tranhd95 Yes, feel free to report this in a new issue.

Unknown3141592 commented 1 year ago

@avinashsai Are you still working on this? If not I would like to give it a try.

UNCANNY69 commented 11 months ago

@mariosasko I would like to take this issue up!

ojuschugh1 commented 9 months ago

Hi @tranhd95 @mariosasko ,I hope you all are doing well.

I am interested in this issue, is this still open and unresolved ?

Thanks and Regards

hritikranjan1 commented 2 weeks ago

@mariosasko I would like to take this issue up.