scipp / sciline

Build scientific pipelines for your data
https://scipp.github.io/sciline/
BSD 3-Clause "New" or "Revised" License
9 stars 2 forks source link

Supporting different file handle types is too difficult #140

Open jl-wynen opened 4 months ago

jl-wynen commented 4 months ago

In the context of https://github.com/scipp/essreduce/issues/8 we considered supporting opening files once and passing the file handle to the various loaders. Supporting this is tricky because there are many different file handle types:

open('..', 'rb', buffering=True) -> io.BufferedReader
open('..', 'rb', buffering=False) -> io.FileIO
io.BytesIO(...) -> io.BytesIO

So supporting these with the current system requires

File = NewType('File', io.FileIO)
BufferedFile = NewType('BufferedFile', io.BufferedReader)
Buffer = NewType('Buffer', io.BytesIO)

def load(file: Union[File, BufferedFile, Buffer]) -> X: ...

with open('..', 'rb') as f:
   sciline.Pipeline([load], params={BufferedFile: BufferedFile(f)

where the param key needs to be adjusted for the concrete type of file handle. This is bad UX and Python has a workaround. For normal functions, we would annotate them as

def ideal_load(file: typing.BinaryIO) -> X: ...

but that is not possible because Pipeline requires an exact type match for parameters, i.e.

File = NewType('File', typing.BinaryIO)
with open('..', 'rb') as f:
   sciline.Pipeline([load], params={File: File(f)

fails.

This gets exasperated when load needs to also support a file path and maybe an open h5py.File or scippnexus.File.

I don't know a good solution at this point. But automatic type conversion in parameters might help (as proposed as part of #139).

SimonHeybrock commented 4 months ago

Maybe the type check in __setitem__ is not a good idea after all? Or would it be possible to relax the check to a base class or protocol?

I have also seen trouble with Python 3.12 type aliases in relation to that, but I need to try that again as I do not remember the details.

YooSunYoung commented 4 months ago

Maybe the type check in setitem is not a good idea after all? Or would it be possible to relax the check to a base class or protocol?

I think checking if the type if a subclass of the desired type should be enough. It's also compatible with static type checkers. Like...

NewInt = NewType("NewInt", int)

a: int = NewInt(1)  # pass the static type check
b: NewInt = 1  # does not pass the static type check

Hmm... but then the Union still won't work...

jl-wynen commented 4 months ago

The IO objects are a bit strange. typing.BinaryIO is an abstract class. But the types in io don't inherit from it. I don't know how type checkers handle them. But a simple check for subclasses or implementations of protocols is not enough. Unless we define our own protocol.

YooSunYoung commented 4 months ago

The IO objects are a bit strange. typing.BinaryIO is an abstract class. But the types in io don't inherit from it. I don't know how type checkers handle them. But a simple check for subclasses or implementations of protocols is not enough. Unless we define our own protocol.

Hmm.... Maybe we can define our own protocol/meta class in this particular case....

MridulS commented 3 months ago

Some reading material: https://github.com/python/typing/discussions/829

SimonHeybrock commented 1 month ago

The isinstance checks have been removed. Should this be closed, or is there more relevant discussion here (e.g., relating to passing mypy?).

jl-wynen commented 1 month ago

I would appreciate some advice on how to implement the affected functions. Should we support str | os.PathLike[str] | BytesIO | StringIO | ... but define the domain type as an alias of str? If so, then the loader provider will not pass mypy. And mypy won't inform us if we use the 'path' incorrectly, e.g., call open(filename) which does not work with file-like objects.

SimonHeybrock commented 1 month ago

I would appreciate some advice on how to implement the affected functions. Should we support str | os.PathLike[str] | BytesIO | StringIO | ...

Can't we define that as the domain type?

but define the domain type as an alias of str? If so, then the loader provider will not pass mypy. And mypy won't inform us if we use the 'path' incorrectly, e.g., call open(filename) which does not work with file-like objects.

jl-wynen commented 1 month ago

I would appreciate some advice on how to implement the affected functions. Should we support str | os.PathLike[str] | BytesIO | StringIO | ...

Can't we define that as the domain type?

This is not allowed:

from typing import NewType
from pathlib import Path
Filename = NewType("Filename", str | Path)

gives this error with mypy:

Argument 2 to NewType(...) must be subclassable (got "str | Path")  [valid-newtype]

See https://mypy.readthedocs.io/en/stable/error_code_list.html#check-the-target-of-newtype-valid-newtype

Using a type alias does not work either:

import sciline as sl
from typing import TypeAlias
from pathlib import Path
Filename: TypeAlias = str | Path
pl = sl.Pipeline([], params={Filename: "test.txt"})

mypy complains with (for the params of Pipeline)

Dict entry 0 has incompatible type "<typing special form>": "str"; expected "type[Any]": "Any"  [dict-item]
SimonHeybrock commented 1 month ago

Does it work with Python 3.12 type aliases?

jl-wynen commented 1 month ago

Not yet:

error: PEP 695 type aliases are not yet supported  [valid-type]
jl-wynen commented 1 month ago

But in any case, waiting for that would mean that all our projects fail type checks for the next 2 years until 3.12 becomes our minimum Python version.