Datalake connector performance

pmbrull commented 2 years ago

We need to analyze if we need to load the whole files when parsing the schemas, or we can pick up a few lines.

In the case of parquet, we might be able to read the first rowGroup or directly the metadata headers.

We need to be able to properly ingest the metadata from multiple GB sized files without needed to load everything in the RAM

pmbrull commented 2 years ago

cc @Beetelbrox

Beetelbrox commented 2 years ago

For parquet it seems we can read the schema & metadata on the parquet file footer quite easily with the tooling provided by pyarrow:

from pyarrow.parquet import ParquetFile
from pyarrow import fs

s3 = fs.S3FileSystem(region='eu-west-1')
pf = ParquetFile(s3.open_input_file(S3_KEY))
print(pf.schema)

This returns the file's schema without reading the file:

required group field_id=-1 schema {
  optional double field_id=-1 X;
  optional double field_id=-1 Y;
  optional double field_id=-1 Z;
  optional double field_id=-1 X_noise;
  optional double field_id=-1 Y_noise;
  optional double field_id=-1 Z_noise;
  optional double field_id=-1 R;
  optional double field_id=-1 G;
  optional double field_id=-1 B;
  optional double field_id=-1 time;
  optional double field_id=-1 eol;
  optional double field_id=-1 label;
}

Tried it in a 2GB parquet file, it read it instantly without any memory usage at all. We also have available the metadata exposed in ParquetFile.metadata:

  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 12
  num_rows: 46234315
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 2778

We only need to convert it to our desired format. Not sure if I'd rather write a transformer from pyarrow to oMeta or if I hack it via Pandas to recycle the existing code.

Beetelbrox commented 2 years ago

I have questions on this ingestion: What it is exactly its intended usage? Ingest metadata of any file under a given prefix in an object store, or just ingest specific static files on specific formats that are tabular-like?

From what I can read in the code we assume as "tables" all objects under s3://<bucket>/prefix/ that have one of the accepted extensions and then we read them and extract/infer their schema, storing them as regular table entities. If I'm not mistaken with my interpretation, we might face several challenges here:

The amount of files in any given data lake bucket is usually quite large. When users read "Datalake ingestion" they might try to use it to ingest their entire datalake storage layer, or at least a large fraction of it because why not. This can lead to a very large amount of "fake" table entities in the catalog that can make browsing difficult.
Files in data lakes are very often part of larger datasets. For example, a prefix where a hive table is stored will contain multiple files that belong to said table, by ingesting the files as individual tables we miss the bigger table context. Maybe it'd be interesting to link the files and the dataset somehow.
Some files & formats are not tabular and don't fit too well in the "table" frame.
The current approach of ingesting sequentially at 1 HTTP request per entity registered might make the runtime needed to run the ingestion prohibitive as the number of files grows.

A dedicated "file" entity might address some of the issues above. Maybe worth considering if we want to enable ingesting more kinds of files

Regardless of the points above, if we want to allow users to ingest arbitrary files' metadata from the datalake we'll need to at least address the following:

The current implementation for S3 will only return 1k files, list_objects is paginated and from what I read in the code we're only reading the first page. We'll need to change it to read all objects under the prefix.
From what I can tell we're ingesting everything in the bucket every time we run the ingestion. We may need a way to ingest only those files that were created/updated since the last ingestion to avoid increasingly long ingestion times as the number of objects pile up over time.

This is an interesting and IMO challenging topic, so if you would like to discuss any of the points above please let me know! Thanks!

pmbrull commented 2 years ago

Hi @Beetelbrox,

Thanks for your thoughts:

IMO we should use the datalake connector for specific buckets/prefixes/paths that are not registered in a given metastore. We could better document the connector and explain the different approaches to listing data from a Datalake, as different connectors might overlap (the clearest being ingesting tables from Glue and Athena, for example).
Agree on the file formats. We are now required to follow tabular data, but we'll add support for NoSQL to handle more general and schemaless data.
Connecting this to the Datalake ingestion, updating the ingested entities from Table to Location (which is currently supported but a bit underused, might need some further tuning to get it right), might be a cleaner approach.
We already have the requirement to not follow a sequential approach for the ingestion in the roadmap, although we would still need to keep sending HTTP requests to the backend. Processing only created/updated files would be interesting, but we'd then need to think about how to handle if a file has been deleted or not. Might not be a significant change.

Also, thanks for noticing that we are missing the pagination. That is a good point that we'll need to address. Thanks for analyzing the parquet read. I wonder if something similar can be done for JSON/CSV/TSV files to not rely on the whole data. Maybe we could just read a % of the file and infer the schema based on that.

pmbrull commented 2 years ago

closing this one in favour of https://github.com/open-metadata/OpenMetadata/issues/6479

open-metadata / OpenMetadata

Datalake connector performance #6295