splitgraph / seafowl

Analytical database for data-driven Web applications 🪶
https://seafowl.io
Apache License 2.0
435 stars 12 forks source link

Feature Request: Comprehensive Export/Import Functionality for Table Management #623

Open mrchypark opened 2 months ago

mrchypark commented 2 months ago

Feature Request: Comprehensive Export/Import Functionality for Datalake Table Management

Overview

First and foremost, thank you for your incredible work on the seafowl project. Your efforts in managing table information using SQLite are truly appreciated. To further enhance the project's capabilities and support continuous table creation and read-only service operations, I'd like to propose a new feature: comprehensive export and import functionality for table information.

Proposed Feature

The feature would consist of three main components:

  1. Full Non-System Table Export/Import:

    • Export all table information from SQLite, excluding system tables.
    • Import this information, including updates to existing tables.
  2. Individual Table Export/Import:

    • Allow export and import of information for specific tables.
  3. Schema-based Table Export/Import:

    • Enable export and import of all table information within a given schema.

Use Case

This feature would greatly benefit users who need to:

Potential Implementation

While I understand that the specifics of implementation would be up to the project maintainers, some initial thoughts include:

Benefits

  1. Enhanced data portability between different seafowl instances.
  2. Simplified management of development/production environments.
  3. Improved support for backup and restore operations.
  4. Greater flexibility in table management across different use cases.

Conclusion

I believe this feature would significantly enhance seafowl's utility and appeal to a broader range of users and use cases. I'm excited to hear your thoughts on this proposal and would be happy to provide any additional information or clarification if needed.

Thank you for considering this feature request, and for your continued dedication to the seafowl project.

gruuya commented 2 months ago

Thanks for the kind words!

What you propose makes sense, and I think for the most part the functionality is already there, mainly via the clade crate (stands for Catalog of Lakehouse Definitions). It enables Seafowl to be run in "headless" mode, where the catalog is abstracted away to some remote system, that can then itself perform all table management functionality according to its needs (which would probably substantially vary from case to case).

The idea is that the clade::schema module provides common definitions needed for Seafowl to read (from) tables, facilitated by implementing and running clade::schema::schema_store_service_server::SchemaStoreServiceServer; you can see an example in the tests here and here.

In fact this is precisely what how we use Seafowl in production ourselves, the metastore is a separate component and Seafowl(s) talk to to it to learn about schemas, tables and object store locations when executing a query. One major drawback however is that it does not support writes, meaning the initial DML would need to be performed out-of-band for now.

mrchypark commented 2 months ago

Thank you for letting me know. I was thinking of a scenario where writing tasks, which use more resources, are performed on multiple Seafowl instances, and then the table information is combined in a read-only Seafowl. I will try using clade, referring to the example you provided.

What does out-of-band mean? Can you tell me what methods are available? I'm currently using a method where I create an external table and then perform aggregation queries to create the necessary tables.

gruuya commented 2 months ago

By out-of-band mean that you use something else to write the tables, or even Seafowl, but then you separately persist the metadata to you metastore and expose it to Seafowl instances via clade.

mrchypark commented 2 months ago

Thank you for explaining clade to me. It seems very useful for working with multiple read-only Seafowl instances. However, it appears that to use this, each Seafowl instance would need to perform its own 'create table' operations, and then a server running clade would need to extract and maintain information about these tables. Is this correct? If so, could you suggest a method for extracting the table information from Seafowl instances so that the clade server can provide this information? I appreciate your help with this. Thank you in advance for any insights you can offer.

gruuya commented 2 months ago

However, it appears that to use this, each Seafowl instance would need to perform its own 'create table' operations, and then a server running clade would need to extract and maintain information about these tables. Is this correct?

Yes that is correct. Until the clade interface is extended to perform writes as well you're going to have to do it out-of-band.

If so, could you suggest a method for extracting the table information from Seafowl instances so that the clade server can provide this information?

Perhaps some custom service can get access to the PG/SQLite connection string and perform replication from them. Or something more out-of-the-box, e.g. https://github.com/superfly/litefs or https://www.splitgraph.com/blog/deploying-serverless-seafowl