Open freeformstu opened 2 years ago
This is something I've thought about a little for my geospatial use case, where it's helpful to be able to store extra (non-columnar) information along with the data, like its coordinate reference system information. In my case, I think I'd opt for arrow extension data type support (which allows for column-level metadata) instead of dataframe-level metadata, but I can see how that wouldn't fit every use case
I think you should use a new-type
pattern for this if you want the DataFrame
to have some extra data.
With regard to extension types. This is something we want to support. We first need FixedSizeList
and then we can work on extension types.
I think it could make a lot of sense to store some metadata on the schema itself. arrow2
is already doing this. I can think of many scenarios it could be used. For example, with the new binary dtype, it would be helpful to store some metadata about the encoding of the binary data.
I am also looking for a persitant way to store additional metadata to polar columns, like @freeformstu suggested. This will be very useful, if one wants to store semantic information, e.g. for machine learning purposes, to a colum. A IRI to an ontology would enable to autonomously interpret the data in a desired way. AI/ML algorithms will "understand" the meaning of a certain datacolumn, e.g. Column "time" -> "EMMO:time" (once one makes a connection to an ontology, a lot of information, like units, relations, etc. can be extracted. I hope this example makes it very clear, why metadata is need (it would be also sad to lose information coming from parquet or pyarrow to polars).
@freeformstu I think you could just accomplish this by subclassing the polars.DataFrame
type as a MetaDataFrame
and using a SimpleNamespace for the meta so you retain the dot accessor access?
import meta_dataframe as mdf
df = MetaDataFrame({"a": [1, 2, 3]})
df.meta.crs = "EPSG:4326"
df.meta.crs = "EPSG:3857"
print(df.meta.crs)
>>> EPSG:3857
By leveraging the Arrow IPC specification you can provide additional functions to write and read the DataFrame while also managing the metadata.
filepath = f"{df.meta.name}.ipc" # use a fully qualified path if desired
df.write_ipc_with_meta(filepath)
loaded_df = mdf.read_ipc_with_meta(filepath)
print(loaded_df.meta.crs)
>>> EPSG:3857
I recently ran into the same issue with sensor data. I'd really like to preserve units, orientations, sampling frequency etc. through processing as it helps with catching bad data. Maybe this could be a polars extension though, or people should just roll with their own implementation on a project basis.
During my experiments, I found that adding support for custom data (could be as simple as being able to pass in an additional Dict[bytes, bytes]
) in the (de-)serialization methods where it makes sense would simplify the implementation dramatically and could make it more robust.
E.g. currently there's really no good way of storing a Categorical(ordering="lexical")
column in Parquet through PyArrow. Hive partitioning also has a few pitfalls w.r.t. data types.
Is it bad style to link my own repo? In any case if someone needs a temporary solution with a lot of pitfalls already ironed out: https://github.com/AlexanderNenninger/parquet_data_classes/tree/main
Problem description
I would like to be able to track dataframe specific metadata through processing, serialization, and deserialization.
A common use case for dataframe metadata is to store data about how the dataframe was generated or metadata about the data contained within its columns.
Below are some examples of existing libraries and formats which have dataframe level metadata. I am definitely open to putting this metadata elsewhere if there's a better place for it.
Arrow
With PyArrow, you can add metadata to the Schema with
with_metadata
. https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.with_metadataIPC
Arrow's IPC format can store File level metadata. https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
Parquet
Parquet has File and Column level metadata. Metadata per column may be useful for some use cases, but for the purposes of this issue I'd like to scope the metadata to the file level. https://parquet.apache.org/docs/file-format/metadata/
Avro
Avro supports file level metadata. https://avro.apache.org/docs/1.11.1/specification/_print/#object-container-files
Pandas
Pandas has an
attrs
attribute for the same purpose. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html