open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.53k stars 1.04k forks source link

Datalake connector performance #6295

Closed pmbrull closed 2 years ago

pmbrull commented 2 years ago

We need to analyze if we need to load the whole files when parsing the schemas, or we can pick up a few lines.

In the case of parquet, we might be able to read the first rowGroup or directly the metadata headers.

We need to be able to properly ingest the metadata from multiple GB sized files without needed to load everything in the RAM

pmbrull commented 2 years ago

cc @Beetelbrox

Beetelbrox commented 2 years ago

For parquet it seems we can read the schema & metadata on the parquet file footer quite easily with the tooling provided by pyarrow:

from pyarrow.parquet import ParquetFile
from pyarrow import fs

s3 = fs.S3FileSystem(region='eu-west-1')
pf = ParquetFile(s3.open_input_file(S3_KEY))
print(pf.schema)

This returns the file's schema without reading the file:

required group field_id=-1 schema {
  optional double field_id=-1 X;
  optional double field_id=-1 Y;
  optional double field_id=-1 Z;
  optional double field_id=-1 X_noise;
  optional double field_id=-1 Y_noise;
  optional double field_id=-1 Z_noise;
  optional double field_id=-1 R;
  optional double field_id=-1 G;
  optional double field_id=-1 B;
  optional double field_id=-1 time;
  optional double field_id=-1 eol;
  optional double field_id=-1 label;
}

Tried it in a 2GB parquet file, it read it instantly without any memory usage at all. We also have available the metadata exposed in ParquetFile.metadata:

  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 12
  num_rows: 46234315
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 2778

We only need to convert it to our desired format. Not sure if I'd rather write a transformer from pyarrow to oMeta or if I hack it via Pandas to recycle the existing code.

Beetelbrox commented 2 years ago

I have questions on this ingestion: What it is exactly its intended usage? Ingest metadata of any file under a given prefix in an object store, or just ingest specific static files on specific formats that are tabular-like?

From what I can read in the code we assume as "tables" all objects under s3://<bucket>/prefix/ that have one of the accepted extensions and then we read them and extract/infer their schema, storing them as regular table entities. If I'm not mistaken with my interpretation, we might face several challenges here:

A dedicated "file" entity might address some of the issues above. Maybe worth considering if we want to enable ingesting more kinds of files

Regardless of the points above, if we want to allow users to ingest arbitrary files' metadata from the datalake we'll need to at least address the following:

This is an interesting and IMO challenging topic, so if you would like to discuss any of the points above please let me know! Thanks!

pmbrull commented 2 years ago

Hi @Beetelbrox,

Thanks for your thoughts:

Also, thanks for noticing that we are missing the pagination. That is a good point that we'll need to address. Thanks for analyzing the parquet read. I wonder if something similar can be done for JSON/CSV/TSV files to not rely on the whole data. Maybe we could just read a % of the file and infer the schema based on that.

pmbrull commented 2 years ago

closing this one in favour of https://github.com/open-metadata/OpenMetadata/issues/6479