seedcase-project / seedcase-sprout

Upload your research data to formally structure it for better, more reliable, and easier research.
https://sprout.seedcase-project.org/
MIT License
0 stars 0 forks source link

Use frictionless data to extract the schema, rather than Polars #493

Closed lwjohnst86 closed 1 month ago

lwjohnst86 commented 3 months ago

There is a Python package for extracting the metadata, https://v4.framework.frictionlessdata.io/docs/guides/describing-data

So it might look like:

from frictionless import Schema

schema = Schema.describe("country-1.csv")
schema.to_yaml("country.schema.yaml") 

If we follow the frictionless approach, this would make it easier to integrate with them.

signekb commented 2 months ago

frictionless can definitely be used to extract an initial schema (for csv files*). We just need to find a way to save it to a db instead of a file.

In #544, I have added frictionless with sql to our dependencies. From the documentation, it seems that frictionless supports both read and write from sql (using sqlalchemy).

*Frictionless doesn't support extraction from e.g., .txt files with data (at least as far as I understand). But they support other file types such as json and parquet

lwjohnst86 commented 1 month ago

We've already started doing this!