pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
37.05k stars 5.83k forks source link

importinto: support batch import multiple tables from files named in `dumpling` style #52142

Open D3Hunter opened 6 months ago

D3Hunter commented 6 months ago

Feature Request

Is your feature request related to a problem? Please describe:

Describe the feature you'd like:

With import into, we can now import into a single table using physical mode, but in many cases user will dump data using dumpling, then import, and there're many tables. Those files are named in this format. It would helpful if we import them together using import into statement too.

the syntax of importing multiple tables from files named in dumpling style is very similar with import into a single table:

IMPORT INTO * FROM <file-path> WITH table_filter = 'test.A*,foo.X', thread = 8 ...

The * in the statement is just a indicator to mean we are importing multiple tables.

This statement works as a syntax sugar for normal import into that only imports into single table, it works like this:

With the batch id, we can get the progress of this import job batch like:

mysql> SHOW IMPORT JOB BATCH 'xxx'
Progress: 10/100 jobs finished, 16 jobs running, 3 jobs failed, 1 jobs cancelled, 123G/1.2T data imported.

you can also query jobs contained in this batch by

SHOW IMPORT JOBS where batch = 'xxx'

to cancel the batch, use

CANCEL IMPORT JOB BATCH xxx

it will cancel all jobs that hasn't done yet, i.e. pending or running.

we only support S3/GCS as <file-path> temporarily, and tidb_enable_dist_task must be enabled to run this sql.

user cannot specified FORMAT xx clause explicitly, we will walk all files in <file-path> and determine it's format from the suffix, and only tables matches with filter is imported.

The thread param only indicate the max cpu usage when importing each table, and as all those jobs runs on TiDB Distributed eXecution Framework (DXF), the actual cpu usage depends how many resource managed by DXF, how many tables to import and the rules we used to assign thread to jobs.

Tasks

Describe alternatives you've considered:

lightning already support this, but import into works as a SQL, more user friendly, and integrated with global sort, so we would to have to feature in import into too.

Teachability, Documentation, Adoption, Migration Strategy:

lance6716 commented 6 months ago

a)

create a separate import into job for each target table with a thread <= thread param specified in the statement

the meaning of thread = 8 is a bit of ambiguous, maybe total_thread = 8 or thread_per_table = 8?

b)

consider the query syntax is SHOW IMPORT JOB GROUP, can we also add GROUP in IMPORT INTO? like IMPORT INTO * GROUP FROM

c)

The * in the statement is just a indicator to mean we are importing multiple tables.

Can we move the filter to here? when it's a string liternal parser can know it's ImportIntoGroupStmt, and when it's a identifier it's ImportIntoStmt

d)

create tables if not exists.

This statement will be complex as a parent statement of CREATE TABLE 😂 I don't like this idea

D3Hunter commented 6 months ago

a)

create a separate import into job for each target table with a thread <= thread param specified in the statement

the meaning of thread = 8 is a bit of ambiguous, maybe total_thread = 8 or thread_per_table = 8?

b)

consider the query syntax is SHOW IMPORT JOB GROUP, can we also add GROUP in IMPORT INTO? like IMPORT INTO * GROUP FROM

c)

The * in the statement is just a indicator to mean we are importing multiple tables.

Can we move the filter to here? when it's a string liternal parser can know it's ImportIntoGroupStmt, and when it's a identifier it's ImportIntoStmt

the description in the pr might change after spec is done and discussed by/with PM, just create this issue to link some prepare pr.

Frank945946 commented 6 months ago

@lance6716 b)

consider the query syntax is SHOW IMPORT JOB GROUP, can we also add GROUP in IMPORT INTO? like IMPORT INTO * GROUP FROM Q:What do you mean?

c)

The * in the statement is just a indicator to mean we are importing multiple tables.

Can we move the filter to here? when it's a string liternal parser can know it's ImportIntoGroupStmt, and when it's a identifier it's ImportIntoStmt Q: Do you mean "import into table_filter from " ?