Open Kludex opened 1 year ago
+1
Actually, numpy
has a pretty big artifact. Should we create pydantic-numpy-types
? 😅
Actually,
numpy
has a pretty big artifact. Should we createpydantic-numpy-types
? 😅
Can't deny the same regarding the https://github.com/pydantic/pydantic-extra-types/issues/24 but I guess it can be good if we have all of the types in one package!
Maybe the extra all
can be created so people can do pip install pydantic-extra-types[all]
, or the extra numpy
. Or... Just don't install numpy
, and show a message like "You need to install numpy".
Maybe the extra
all
can be created so people can dopip install pydantic-extra-types[all]
, or the extranumpy
. Or... Just don't installnumpy
, and show a message like "You need to install numpy".
I agree with both options - either displaying a message that prompts the user to install Numpy, or proceeding with the extra requirements since it functions properly in our current scenario, this will allow for the possibility of adding any big package as extra
Hey folks,
Just so I understand a little bit better our objective with the numpy
type here. I imagine you intend to have something like (correct me if I am talking nonsense):
Numpy(value=np.float64(12))
Do we want to have some conversions as well ? because my first idea was to validate using isinstance(value, np.generic)
. However, when running isinstance(12, np.generic)
the result is False
(the same happens with a python float). Would we want to convert those elements to numpy when someone invoke the type ?
The reason why I am asking this is the following: if we want to do those conversions we would probably need to make some decisions (e.g., we have int32
, int64
, and so on).
If my reasoning is not fair, just let me know 😄
@kroncatti Pydantic has two modes of parsing/validation: strict and lax. In lax mode it already coerces types in many cases such as this:
class Model(BaseModel):
foo: int
m = Model(foo='12')
print(m)
# foo=12
print(type(m.foo))
# <class 'int'>
So, it seems natural for Pydantic to coerce values to numpy types on validation.
Thanks @lig,
Just to check if I properly understood. If we set:
class Model(BaseModel):
foo: Numpy
m = Model(foo='12')
print(m)
# foo=12
print(type(m.foo))
# <class 'numpy.int64'>
This should be the outcome ?
Shouldn't the user have to specify what numpy type we are going to coerce ? such as int64, float64, etc.
I think the idea here is to support the following types: https://numpy.org/doc/stable/user/basics.types.html#array-types-and-conversions-between-types
We should create the following, and all the analogous:
pydantic_extra_types.NumPyFloatHalf
/ pydantic_extra_types.NumPyFloat16
pydantic_extra_types.NumPySingle
pydantic_extra_types.NumPyDouble
I guess we also want np.array
, np.datetime64
, and others.
That makes sense. so we are basically creating one extra type for each one of those types instead of having a generic type for all of them. Cool!
Hey guys, I want to help with this topic, but first I want to check if I understand how should I create these new types. My idea is something like in the code below. The only problem is that in strict mode validation passes with int and not with numpy.int8 (probably because I am using int_schema). Did you maybe have some other ideas?
class NumPyInt8(numpy.int8):
"""
A numpy.int8 type. The range is between -128 and 127.
"""
min_value: int = -128
max_value: int = 127
@classmethod
def __get_pydantic_core_schema__(cls, source: type[Any], handler: GetCoreSchemaHandler) -> core_schema.CoreSchema:
return core_schema.general_after_validator_function(
cls._transform,
core_schema.int_schema(le=cls.max_value, ge=cls.min_value)
)
@classmethod
def _transform(cls, scalar: int, _: core_schema.ValidationInfo) -> numpy.int8:
return numpy.int8(scalar)
Hello again, @yezz123 and @Kludex, can you take a look at my fork for numpy integers? https://github.com/frenki123/pydantic-extra-types/tree/numpy-int-types
Maybe it can be added to the main branch as a start for support of numpy types.
One of the issues to deal with is that JSON cannot natively represent all the dtype's for numpy arrays without extra context. So we would have to choose a reasonable schema as a default for the serializing/deserializing to be isomorphic. Namely, we need to have an opinionated default for representing arrays of complex numbers. There are of course many valid options so having it be easily overwritable would also be useful.
That being said, the way we have been solving is as well as the ASE project has been with
{'__ndarray__': (
obj.shape,
str(obj.dtype),
flatobj.tolist()
)}
The serializing / deserializing logic is in the file I linked.
I suggest you to check out my project pydantic-numpy
from https://github.com/pydantic/pydantic/issues/7980
I wonder if it makes more sense to integrate the logic you've designed in pydantic-numpy into pydantic-extra-types.
I'd personally prefer that pydantic/pydantic-extra-types natively support numpy types.
pydantic-numpy requires saving arrays to files rather than serializing and deserializing the numpy types themselves. I think this discussion raises some good points: https://github.com/pydantic/pydantic/discussions/4964
My use case is that I want to define a metadata standard for ml models that take large and complex arrays. the metadata just needs to record the numpy type, order of labeled dimensions, and the shape. Saving out the whole array to load into the Numpy Model isn't preferrable, it would be uneccessary storage and susceptible to path errors
The co-author of pydantic-numpy
here.
My use case is that I want to define a metadata standard for ml models that take large and complex arrays.
If you only want validation, and dimension enforcement, just use pydantic_numpy.types
, it is compatible with pydantic.BaseModel
, even pydantic.dataclass
.
pydantic-numpy requires saving arrays to files rather than serializing and deserializing the numpy types themselves.
@rbavery your statement is false, interaction with numpy files or saving/loading is an optional quality of life feature, and is only offered with NumpyModel
; you can ignore this feature in your described case.
Please be careful when you make these claims, I'd rather have you ask the question than claiming if you are uncertain.
Update
I wanted to share some updates and thoughts regarding the pydantic-numpy
project. Here's what we're looking into:
Refactoring of pydantic-numpy.typing: We're giving the submodule a minor overhaul. The key change is the transition to automatically generated code. This shift is essential since dynamically generated typing hints aren't compatible with static type checkers like MyPy and PyRight. The refactor is compatible with static type checkers, and the code generator's script is available in the repository for reference.
Introducing NumpyModel for File IO: We've added a NumpyModel that supports integrated file IO operations. This should streamline processes that involve NumPy data handling.
Comprehensive Testing for Coverage: We've also put a significant effort into extensive testing to ensure robust coverage and reliability.
Regarding the integration of pydantic-numpy
with this repository, I propose keeping them separate for the following reasons:
Complexity Management: Merging pydantic-numpy
into this repository would significantly increase the complexity of both codebases. Our goal is to maintain simplicity and clarity in our projects.
Community Feedback: I'm aware of the requests to incorporate NumPy types directly into this repository. While I understand the perspective, I believe maintaining separation is in our best interest for streamlined development and maintenance.
Documentation Update: To make pydantic-numpy
more discoverable for those who need it, we're considering adding a small dedicated section about it in the Pydantic documentation. I hope this update aligns with your goals, and I look forward to your thoughts and feedback.
woops sorry @caniko, my bad. I read the readme incorrectly. Thanks for correcting.
I wanna some thing like this:
from numpy import uint16, uint32
class FeederArc(BaseModel):
satID: uint16
satelliteAntenna: uint16 = Field(ge=1, le=256)
baseStationFeeder: uint32
That looks very strange. Why would you use these types outside of arrays? Python int
would be much better here.
@caniko we need some types like uint16,uint32, but pydantic does not define them.
Those types are not Python primitives, and the precision up or down casting is done for you when you use int
. You are not gaining any benefit in using these types, not even saving memory because of the Python overhead around numpy types.
If you want non-negative values within a certain range, use pydantic.AfterValidator
that raises an error when outside the range.
@caniko, if you combine using numpy types in pydantic with ORJSON, which supports numpy, you can significantly reduce JSON payload sizes. For instance, in our embeddings API, using numpy with float32 instead of Python's default float64 reduces the payload size for 254 embeddings from 1.73MB to 1.17MB.
I would suggest that you do the filtering before moving into models inside the arrays. Numpy is perfect for this, and you should adhere to this idiom.
I don't see the value of adding Pydantic support for these kinds of operations, just do orjson serialization from the numpy types with arbitrary types allowed set to true in your models. Edit: You must also follow the nescessary steps to activate numpy serialization.
If you care that much about payload size, why use Python? Why not use go for API, and have pub/sub to your Python AI/ML? Genuinely curious because you are trying to optimize for things that Python was never designed to perform.
The idea here is to support the numpy types mentioned on https://numpy.org/doc/stable/reference/arrays.scalars.html.