Closed pmbrull closed 2 years ago
cc @Beetelbrox
For parquet it seems we can read the schema & metadata on the parquet file footer quite easily with the tooling provided by pyarrow:
from pyarrow.parquet import ParquetFile
from pyarrow import fs
s3 = fs.S3FileSystem(region='eu-west-1')
pf = ParquetFile(s3.open_input_file(S3_KEY))
print(pf.schema)
This returns the file's schema without reading the file:
required group field_id=-1 schema {
optional double field_id=-1 X;
optional double field_id=-1 Y;
optional double field_id=-1 Z;
optional double field_id=-1 X_noise;
optional double field_id=-1 Y_noise;
optional double field_id=-1 Z_noise;
optional double field_id=-1 R;
optional double field_id=-1 G;
optional double field_id=-1 B;
optional double field_id=-1 time;
optional double field_id=-1 eol;
optional double field_id=-1 label;
}
Tried it in a 2GB parquet file, it read it instantly without any memory usage at all. We also have available the metadata exposed in ParquetFile.metadata
:
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 12
num_rows: 46234315
num_row_groups: 1
format_version: 1.0
serialized_size: 2778
We only need to convert it to our desired format. Not sure if I'd rather write a transformer from pyarrow to oMeta or if I hack it via Pandas to recycle the existing code.
I have questions on this ingestion: What it is exactly its intended usage? Ingest metadata of any file under a given prefix in an object store, or just ingest specific static files on specific formats that are tabular-like?
From what I can read in the code we assume as "tables" all objects under s3://<bucket>/prefix/
that have one of the accepted extensions and then we read them and extract/infer their schema, storing them as regular table entities.
If I'm not mistaken with my interpretation, we might face several challenges here:
A dedicated "file" entity might address some of the issues above. Maybe worth considering if we want to enable ingesting more kinds of files
Regardless of the points above, if we want to allow users to ingest arbitrary files' metadata from the datalake we'll need to at least address the following:
This is an interesting and IMO challenging topic, so if you would like to discuss any of the points above please let me know! Thanks!
Hi @Beetelbrox,
Thanks for your thoughts:
Also, thanks for noticing that we are missing the pagination. That is a good point that we'll need to address. Thanks for analyzing the parquet read. I wonder if something similar can be done for JSON/CSV/TSV files to not rely on the whole data. Maybe we could just read a % of the file and infer the schema based on that.
closing this one in favour of https://github.com/open-metadata/OpenMetadata/issues/6479
We need to analyze if we need to load the whole files when parsing the schemas, or we can pick up a few lines.
In the case of parquet, we might be able to read the first rowGroup or directly the metadata headers.
We need to be able to properly ingest the metadata from multiple GB sized files without needed to load everything in the RAM