DIscussion: broad support for array validation

liquidcarbon commented 3 months ago

Hi, I ran into your repo and wanted to share that I've been thinking about something similar.

I'd like to be able to declare typed vector dataclasses as concisely as possible, and validate/serialize them using python or SQL: https://github.com/duckdb/duckdb/discussions/13405

I've played with pydantic and don't see it supporting my use case.

I'm curious why building for pydantic was important to you, and what other alternatives you considered?

sneakers-the-rat commented 3 months ago

Fair question :)

So the type does work on its own as just a standard python type, but I have still yet to get satisfying behavior from the python type system or the various static type checkers. Pydantic helps with that part a bit, where by patching into its schema system some of that can be salvaged.

As a historical note, it is also related to my translation of a neuroscientific data standard into linkml, and for that we wanted very simple dataclass-like models with strict validation.

The other part is that its a good middle ground between web tech and more traditional python programming - so wanted something where data models could be used in data analysis tools as well as APIs via eg. FastAPI, and arrays are not usually well supported across both.

For my personal reasons I want to extend out the idea and use pydantic models as a quasi ORM to schematized graph databases, and they seemed like a better framework to develop off than eg. Sqlalchemy or vanilla dataclasses.

The recent perf problems, particularly with import speed, are giving me second thoughts, but they still are serving my needs well.

See also sqlmodel for pydantic & SQL, and it makes sense to me to eventually bridge those, though thats more of a classic ORM that doesnt necessarily treat fields as vectors but rather rows as class instances

liquidcarbon commented 3 months ago

Nice, thanks!

doesn't treat fields as vectors

That's one of the reasons that I'm unsatisfied with what exists. Pydantic and all ORM tools serve traditional OLTP use cases where each row means something and schemas change rarely. I'm building for OLAP querying where you're "building an airplane (schema) in flight", write rarely, in blocks of million rows at a time, and run complex aggregations. Think a plate reader or mass-spec run or microscope image (I also come from life sciences).

sneakers-the-rat commented 3 months ago

Aha, well I think we're thinking along the same lines :)

This work grew out of linkml arrays: https://linkml.io/linkml/schemas/arrays.html

Which is something that a few biomed formats are trying to get in a state that it can be a target for interoperability - abstract, format-neutral array specs with generic bridges to implementations using tools like this, exactly for the kind of "write rarely, large array, complex query" things, and I am in particular interested in the "building an airplane in flight" part bc thats exactly my goal

And you can see how im in the process of using this package over here with neurodata without borders, though we are still not quite to a place where I can show you what that is intended to look like yet: https://github.com/p2p-ld/nwb-linkml/tree/main/nwb_linkml/src/nwb_linkml/models/pydantic/core/v2_7_0

You may also be interested in https://github.com/orgs/linkml/discussions/2020

Curious what standard/format you're thinking of?

liquidcarbon commented 3 months ago

I'll need to read up on LinkML!

I'm building towards being able to declare, as concisely as possible, typed and annotated dataclasses:

class MyData(BaseData):
  i: Scalar("foreign key or partition")
  x: VectorInt16("16 bit int timestamps")
  y: VectorFloat64("something we measured")

# usage:
d = MyData()
d.ddl  # generates CREATE TABLE statements
d.df  # empty Pandas dataframe with proper types (or Polars, or Arrow)

d = MyData.from_pandas(input_df, dtype_conversion="raise/coerce/skip")  # instantiate from other data
d.df  # show as pandas
d.to_parquet()  # write Parquet with metadata

Where BaseData handles all the magic. The underlying vector arrays are numpy or nullable pandas types.

I've written a prototype that does almost what I need, but it's pretty awkward:

liquidcarbon commented 3 months ago

Side note: I LOL'd about the pydantic-LOL example!

When dealing with imaginary microscopy data, where say you've got F fields over C channels at T timepoints of (X,Y) resolution, instead of a 5-dimensional arrays I'd rather break it down into F parquet files with C*T columns and X*Y rows.

That way any specific FOV is easily accessible in its own file, and images can be recreated from 1D array with simple .reshape(x, y).

There's quite a lot of processing you can do for pennies using AWS Athena that would demand pricey compute otherwise. I'm not a fan of zarr and other high-D arrays - in part because I've never used them. But I'm weird like that, I've written feature calling in SQL :)

sneakers-the-rat commented 3 months ago

When dealing with imaginary microscopy data, where say you've got F fields over C channels at T timepoints of (X,Y) resolution, instead of a 5-dimensional arrays I'd rather break it down into F parquet files with CT columns and XY rows.

that's exactly the kind of abstraction that numpydantic is for - being able to specify an abstract shape/dtype/whatever else you want and then be able to satisfy it with whatever kind of backend you want to. we're dealing with conflicting constraints across a number of currently mutually incompatible biomed formats with their own strong opinions between whatever fancy feature calling spec you prefer or high-d chunked array storage: point is it shouldn't matter from an API pov.

liquidcarbon commented 3 months ago

Estimated Reading Time: 279 minutes (75143 words)

Was that a casual afternoon of writing or your dissertation?

sneakers-the-rat commented 3 months ago

about a third of it

sneakers-the-rat commented 3 months ago

PRs welcome re: format interconversion tho, it's sort of a harder problem than it seems, but you could probably balance that out by specifying mixin classes that clarify when a to_{format} applies to a from_{format}

liquidcarbon commented 3 months ago

I don't have a lot of strong opinions but I feel like the high-d arrays camp is missing out from not being on board with de facto data standard for everything, which is Parquet on S3-compatible API. Lance is also built on pyarrow and powers extremely demanding ML use cases.

sneakers-the-rat commented 3 months ago

I don't have a lot of strong opinions but I feel like the high-d arrays camp is missing out from not being on board with de facto data standard, which is Parquet on S3-compatible API. Lance is also built on pyarrow and powers extremely demanding ML use cases.

great! if this is an interface you would like to write, then that's what numpydantic is for - being able to express an API that can be abstract across array implementations. people also store data as images and video files. and as csv files and .mat file and whatever other hideous format you can imagine -- this is an abstraction layer between the abstract notion of "an array of this kind" and the concrete way that it is impemented. my goal is not to make everyone arrive at the data standard and representation that is most friendly to Amazon's servers at a particular scale, but make it possible to express data in whatever form it comes in. It doesn't often actually trade off with perf, but the design priority here is expressiveness.

re: prior post, the problem with a strategy of making a new base class with its own model semantics (there are a number of these, particularly for dataframes) is that it requires everyone use a different base class with different model semantics. often people already have their whole modeling system in place and don't need a different framework. so that's why linkml is a little bit orthogonal bc the goal is to be able to express a schema in a very abstract format and translate it to many others. this is the part that's in between the many data modeling tools and the comparatively few but still significant number of schema languages. so it would be possible to write an interface to say NDArray[Shape, DType, WhateverConstraint] and satisfy that with a parquet and object storage backend.

so if your goal is to express a very specific data model in a very specific implementation, then this project may be relevant as a way to connect it to a more general schema language, or who knows you may have some interesting ideas about format abstraction and i'd love to see them! trying to figure out common ground here

p2p-ld / numpydantic

DIscussion: broad support for array validation #9