pydantic / pydantic-extra-types

Extra Pydantic types.
MIT License
195 stars 51 forks source link

Support numpy types #31

Open Kludex opened 1 year ago

Kludex commented 1 year ago

The idea here is to support the numpy types mentioned on https://numpy.org/doc/stable/reference/arrays.scalars.html.

yezz123 commented 1 year ago

+1

Kludex commented 1 year ago

Actually, numpy has a pretty big artifact. Should we create pydantic-numpy-types? 😅

yezz123 commented 1 year ago

Actually, numpy has a pretty big artifact. Should we create pydantic-numpy-types? 😅

Can't deny the same regarding the https://github.com/pydantic/pydantic-extra-types/issues/24 but I guess it can be good if we have all of the types in one package!

Kludex commented 1 year ago

Maybe the extra all can be created so people can do pip install pydantic-extra-types[all], or the extra numpy. Or... Just don't install numpy, and show a message like "You need to install numpy".

yezz123 commented 1 year ago

Maybe the extra all can be created so people can do pip install pydantic-extra-types[all], or the extra numpy. Or... Just don't install numpy, and show a message like "You need to install numpy".

I agree with both options - either displaying a message that prompts the user to install Numpy, or proceeding with the extra requirements since it functions properly in our current scenario, this will allow for the possibility of adding any big package as extra

kroncatti commented 1 year ago

Hey folks,

Just so I understand a little bit better our objective with the numpy type here. I imagine you intend to have something like (correct me if I am talking nonsense):

Numpy(value=np.float64(12))

Do we want to have some conversions as well ? because my first idea was to validate using isinstance(value, np.generic). However, when running isinstance(12, np.generic) the result is False (the same happens with a python float). Would we want to convert those elements to numpy when someone invoke the type ?

The reason why I am asking this is the following: if we want to do those conversions we would probably need to make some decisions (e.g., we have int32, int64, and so on).

If my reasoning is not fair, just let me know 😄

lig commented 1 year ago

@kroncatti Pydantic has two modes of parsing/validation: strict and lax. In lax mode it already coerces types in many cases such as this:

    class Model(BaseModel):
        foo: int

    m = Model(foo='12')
    print(m)
    # foo=12
    print(type(m.foo))
    # <class 'int'>

So, it seems natural for Pydantic to coerce values to numpy types on validation.

kroncatti commented 1 year ago

Thanks @lig,

Just to check if I properly understood. If we set:

class Model(BaseModel):
    foo: Numpy

m = Model(foo='12')
print(m)
# foo=12
print(type(m.foo))
# <class 'numpy.int64'>

This should be the outcome ?

Shouldn't the user have to specify what numpy type we are going to coerce ? such as int64, float64, etc.

Kludex commented 1 year ago

I think the idea here is to support the following types: https://numpy.org/doc/stable/user/basics.types.html#array-types-and-conversions-between-types

We should create the following, and all the analogous:

I guess we also want np.array, np.datetime64, and others.

kroncatti commented 1 year ago

That makes sense. so we are basically creating one extra type for each one of those types instead of having a generic type for all of them. Cool!

frenki123 commented 1 year ago

Hey guys, I want to help with this topic, but first I want to check if I understand how should I create these new types. My idea is something like in the code below. The only problem is that in strict mode validation passes with int and not with numpy.int8 (probably because I am using int_schema). Did you maybe have some other ideas?

class NumPyInt8(numpy.int8):
    """
    A numpy.int8 type. The range is between -128 and 127.
    """
    min_value: int = -128
    max_value: int = 127

    @classmethod
    def __get_pydantic_core_schema__(cls, source: type[Any], handler: GetCoreSchemaHandler) -> core_schema.CoreSchema:
        return core_schema.general_after_validator_function(
            cls._transform,
            core_schema.int_schema(le=cls.max_value, ge=cls.min_value)
        )

    @classmethod
    def _transform(cls, scalar: int, _: core_schema.ValidationInfo) -> numpy.int8:
        return numpy.int8(scalar)
frenki123 commented 1 year ago

Hello again, @yezz123 and @Kludex, can you take a look at my fork for numpy integers? https://github.com/frenki123/pydantic-extra-types/tree/numpy-int-types

Maybe it can be added to the main branch as a start for support of numpy types.

GuillaumeQuenneville commented 1 year ago

One of the issues to deal with is that JSON cannot natively represent all the dtype's for numpy arrays without extra context. So we would have to choose a reasonable schema as a default for the serializing/deserializing to be isomorphic. Namely, we need to have an opinionated default for representing arrays of complex numbers. There are of course many valid options so having it be easily overwritable would also be useful.

That being said, the way we have been solving is as well as the ASE project has been with

{'__ndarray__': (
    obj.shape,
    str(obj.dtype),
    flatobj.tolist()
)}

The serializing / deserializing logic is in the file I linked.

caniko commented 1 year ago

I suggest you to check out my project pydantic-numpy

rbavery commented 11 months ago

from https://github.com/pydantic/pydantic/issues/7980

I wonder if it makes more sense to integrate the logic you've designed in pydantic-numpy into pydantic-extra-types.

I'd personally prefer that pydantic/pydantic-extra-types natively support numpy types.

pydantic-numpy requires saving arrays to files rather than serializing and deserializing the numpy types themselves. I think this discussion raises some good points: https://github.com/pydantic/pydantic/discussions/4964

My use case is that I want to define a metadata standard for ml models that take large and complex arrays. the metadata just needs to record the numpy type, order of labeled dimensions, and the shape. Saving out the whole array to load into the Numpy Model isn't preferrable, it would be uneccessary storage and susceptible to path errors

caniko commented 11 months ago

The co-author of pydantic-numpy here.

My use case is that I want to define a metadata standard for ml models that take large and complex arrays.

If you only want validation, and dimension enforcement, just use pydantic_numpy.types, it is compatible with pydantic.BaseModel, even pydantic.dataclass.

pydantic-numpy requires saving arrays to files rather than serializing and deserializing the numpy types themselves.

@rbavery your statement is false, interaction with numpy files or saving/loading is an optional quality of life feature, and is only offered with NumpyModel; you can ignore this feature in your described case.

Please be careful when you make these claims, I'd rather have you ask the question than claiming if you are uncertain.

Update I wanted to share some updates and thoughts regarding the pydantic-numpy project. Here's what we're looking into:

Refactoring of pydantic-numpy.typing: We're giving the submodule a minor overhaul. The key change is the transition to automatically generated code. This shift is essential since dynamically generated typing hints aren't compatible with static type checkers like MyPy and PyRight. The refactor is compatible with static type checkers, and the code generator's script is available in the repository for reference.

Introducing NumpyModel for File IO: We've added a NumpyModel that supports integrated file IO operations. This should streamline processes that involve NumPy data handling.

Comprehensive Testing for Coverage: We've also put a significant effort into extensive testing to ensure robust coverage and reliability.

Regarding the integration of pydantic-numpy with this repository, I propose keeping them separate for the following reasons:

Complexity Management: Merging pydantic-numpy into this repository would significantly increase the complexity of both codebases. Our goal is to maintain simplicity and clarity in our projects.

Community Feedback: I'm aware of the requests to incorporate NumPy types directly into this repository. While I understand the perspective, I believe maintaining separation is in our best interest for streamlined development and maintenance.

Documentation Update: To make pydantic-numpy more discoverable for those who need it, we're considering adding a small dedicated section about it in the Pydantic documentation. I hope this update aligns with your goals, and I look forward to your thoughts and feedback.

rbavery commented 11 months ago

woops sorry @caniko, my bad. I read the readme incorrectly. Thanks for correcting.

honglei commented 4 months ago

I wanna some thing like this:

from numpy import uint16, uint32
class FeederArc(BaseModel):
    satID: uint16  
    satelliteAntenna: uint16 = Field(ge=1, le=256)
    baseStationFeeder: uint32 
caniko commented 4 months ago

That looks very strange. Why would you use these types outside of arrays? Python int would be much better here.

honglei commented 4 months ago

@caniko we need some types like uint16,uint32, but pydantic does not define them.

caniko commented 4 months ago

Those types are not Python primitives, and the precision up or down casting is done for you when you use int. You are not gaining any benefit in using these types, not even saving memory because of the Python overhead around numpy types.

If you want non-negative values within a certain range, use pydantic.AfterValidator that raises an error when outside the range.

juliuslipp commented 4 months ago

@caniko, if you combine using numpy types in pydantic with ORJSON, which supports numpy, you can significantly reduce JSON payload sizes. For instance, in our embeddings API, using numpy with float32 instead of Python's default float64 reduces the payload size for 254 embeddings from 1.73MB to 1.17MB.

caniko commented 4 months ago

I would suggest that you do the filtering before moving into models inside the arrays. Numpy is perfect for this, and you should adhere to this idiom.

I don't see the value of adding Pydantic support for these kinds of operations, just do orjson serialization from the numpy types with arbitrary types allowed set to true in your models. Edit: You must also follow the nescessary steps to activate numpy serialization.

If you care that much about payload size, why use Python? Why not use go for API, and have pub/sub to your Python AI/ML? Genuinely curious because you are trying to optimize for things that Python was never designed to perform.