pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.51k stars 1.98k forks source link

Add DataFrame level metadata #5117

Open freeformstu opened 2 years ago

freeformstu commented 2 years ago

Problem description

I would like to be able to track dataframe specific metadata through processing, serialization, and deserialization.

A common use case for dataframe metadata is to store data about how the dataframe was generated or metadata about the data contained within its columns.

Below are some examples of existing libraries and formats which have dataframe level metadata. I am definitely open to putting this metadata elsewhere if there's a better place for it.

Arrow

With PyArrow, you can add metadata to the Schema with with_metadata. https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.with_metadata

IPC

Arrow's IPC format can store File level metadata. https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

Parquet

Parquet has File and Column level metadata. Metadata per column may be useful for some use cases, but for the purposes of this issue I'd like to scope the metadata to the file level. https://parquet.apache.org/docs/file-format/metadata/

Avro

Avro supports file level metadata. https://avro.apache.org/docs/1.11.1/specification/_print/#object-container-files

Pandas

Pandas has an attrs attribute for the same purpose. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html

kylebarron commented 2 years ago

This is something I've thought about a little for my geospatial use case, where it's helpful to be able to store extra (non-columnar) information along with the data, like its coordinate reference system information. In my case, I think I'd opt for arrow extension data type support (which allows for column-level metadata) instead of dataframe-level metadata, but I can see how that wouldn't fit every use case

ritchie46 commented 2 years ago

I think you should use a new-type pattern for this if you want the DataFrame to have some extra data.

With regard to extension types. This is something we want to support. We first need FixedSizeList and then we can work on extension types.

universalmind303 commented 2 years ago

I think it could make a lot of sense to store some metadata on the schema itself. arrow2 is already doing this. I can think of many scenarios it could be used. For example, with the new binary dtype, it would be helpful to store some metadata about the encoding of the binary data.

markdoerr commented 1 year ago

I am also looking for a persitant way to store additional metadata to polar columns, like @freeformstu suggested. This will be very useful, if one wants to store semantic information, e.g. for machine learning purposes, to a colum. A IRI to an ontology would enable to autonomously interpret the data in a desired way. AI/ML algorithms will "understand" the meaning of a certain datacolumn, e.g. Column "time" -> "EMMO:time" (once one makes a connection to an ontology, a lot of information, like units, relations, etc. can be extracted. I hope this example makes it very clear, why metadata is need (it would be also sad to lose information coming from parquet or pyarrow to polars).

Insighttful commented 9 months ago

@freeformstu I think you could just accomplish this by subclassing the polars.DataFrame type as a MetaDataFrame and using a SimpleNamespace for the meta so you retain the dot accessor access?

import meta_dataframe as mdf

df = MetaDataFrame({"a": [1, 2, 3]})
df.meta.crs = "EPSG:4326"
df.meta.crs = "EPSG:3857"
print(df.meta.crs)

>>> EPSG:3857

By leveraging the Arrow IPC specification you can provide additional functions to write and read the DataFrame while also managing the metadata.

filepath = f"{df.meta.name}.ipc"  # use a fully qualified path if desired
df.write_ipc_with_meta(filepath)
loaded_df = mdf.read_ipc_with_meta(filepath)
print(loaded_df.meta.crs)

>>> EPSG:3857

As I answered on Stack:

Click here for MetaDataFrame module example ```python # meta_dataframe.py """Provides functionality for handling Polars DataFrames with custom metadata. This module enables the serialization and deserialization of Polars DataFrames along with associated metadata, utilizing the IPC format for data interchange and `orjson` for fast JSON processing. Metadata management is facilitated through the use of the `DfMeta` class, a flexible container for arbitrary metadata fields. Key functions include `write_ipc_with_meta` and `read_ipc_with_meta`, which allow for the persistence of metadata across storage cycles, enhancing data context retention and utility in analytical workflows. Note: This module was not written for efficiency or performance, but to solve the use case of persisting metadata with Polars DataFrames. It is not recommended for production use, but rather as a starting point for more robust metadata management. Classes: DfMeta: A simple namespace for metadata management. MetaDataFrame: An extension of Polars DataFrame to include metadata. Functions: write_ipc_with_meta(df, filepath, meta): Serialize DataFrame and metadata to IPC. read_ipc_with_meta(filepath): Deserialize DataFrame and metadata from IPC. """ # Standard Library from typing import Any, Dict from types import SimpleNamespace # Third Party import orjson import polars as pl import pyarrow as pa class DfMeta(SimpleNamespace): """A simple namespace for storing MetaDataFrame metadata. Usage: meta = DfMeta( name="checkins", db_name="my_db", tz_name="America/New_York", crs="EPSG:4326" ) """ # Generate a string representation of metadata keys def __repr__(self) -> str: keys = ", ".join(self.__dict__.keys()) return f"DfMeta({keys})" # Alias __str__ to __repr__ for consistent string representation def __str__(self) -> str: return self.__repr__() class MetaDataFrame(pl.DataFrame): """A Polars DataFrame extended to include custom metadata. Attributes: meta (DfMeta): A simple namespace for storing metadata. Usage: # Create MetaDataFrame with metadata meta = DfMeta( name="my_df", db_name="my_db", tz_name="America/New_York", crs="EPSG:4326" ) df = MetaDataFrame({"a": [1, 2, 3]}, meta=meta) # Create MetaDataFrame then add metadata df = MetaDataFrame({"a": [1, 2, 3]}) df.meta.name = "my_df" df.meta.db_name = "my_db" df.meta.tz_name = "America/New_York" df.meta.crs = "EPSG:4326" # Overwrite metadata df.meta.crs = "EPSG:3857" # Write MetaDataFrame to IPC with metadata df.write_ipc_with_meta("my_df.ipc") # Read MetaDataFrame from IPC with metadata loaded_df = read_ipc_with_meta("my_df.ipc") # Access metadata print(loaded_df.meta.name) print(loaded_df.meta_as_dict()) """ # Initialize DataFrame with `meta` attr SimpleNamespace def __init__(self, data: Any = None, *args, meta: DfMeta = None, **kwargs): super().__init__(data, *args, **kwargs) self.meta = meta if meta else DfMeta() def meta_as_dict(self) -> dict[str, Any]: """Returns the metadata as a dictionary. Returns: dict[str, Any]: A dictionary representation of the metadata. """ return vars(self.meta) def write_ipc_with_meta(self, filepath: str) -> None: """Serialize MetaDataFrame and metadata stored in `meta` attr to an IPC file. Args: filepath (str): The path to the IPC file. Returns: None """ # Convert Polars DataFrame to Arrow Table arrow_table = self.to_arrow() # Serialize metadata to JSON meta: DfMeta = self.meta meta_dict = {k: v for k, v in meta.__dict__.items()} meta_json = orjson.dumps(meta_dict) # Embed metadata into Arrow schema new_schema = arrow_table.schema.with_metadata({"meta": meta_json}) arrow_table_with_meta = arrow_table.replace_schema_metadata(new_schema.metadata) # Write Arrow table with metadata to IPC file with pa.OSFile(filepath, "wb") as sink: with pa.RecordBatchStreamWriter( sink, arrow_table_with_meta.schema ) as writer: writer.write_table(arrow_table_with_meta) def read_ipc_with_meta(filepath: str) -> MetaDataFrame: """Deserialize DataFrame and metadata from an IPC file. Args: filepath (str): The path to the IPC file. Returns: MetaDataFrame: The deserialized DataFrame with metadata stored in `meta` attr. """ # Read Arrow table from IPC file with pa.OSFile(filepath, "rb") as source: reader = pa.ipc.open_stream(source) table = reader.read_all() # Extract and deserialize metadata from Arrow schema meta_json = table.schema.metadata.get(b"meta") if meta_json: meta_dict = orjson.loads(meta_json) meta = DfMeta(**meta_dict) else: meta = DfMeta() # Convert Arrow table to Polars DataFrame and attach metadata df = pl.from_arrow(table) extended_df = MetaDataFrame(df, meta=meta) return extended_df ```
AlexanderNenninger commented 7 months ago

I recently ran into the same issue with sensor data. I'd really like to preserve units, orientations, sampling frequency etc. through processing as it helps with catching bad data. Maybe this could be a polars extension though, or people should just roll with their own implementation on a project basis.

During my experiments, I found that adding support for custom data (could be as simple as being able to pass in an additional Dict[bytes, bytes]) in the (de-)serialization methods where it makes sense would simplify the implementation dramatically and could make it more robust.

E.g. currently there's really no good way of storing a Categorical(ordering="lexical") column in Parquet through PyArrow. Hive partitioning also has a few pitfalls w.r.t. data types.

Is it bad style to link my own repo? In any case if someone needs a temporary solution with a lot of pitfalls already ironed out: https://github.com/AlexanderNenninger/parquet_data_classes/tree/main