pingcap-inc / tidb2dw

Replicate data changes from TiDB to Data Warehouses in real-time
MIT License
13 stars 7 forks source link

Snapshot with generated column type doesn't work #93

Closed eugen-korentsov closed 7 months ago

eugen-korentsov commented 9 months ago

How to reproduce:

CREATE TABLE `ref_test` (
  `id` INT NOT NULL AUTO_INCREMENT,
  `text` VARCHAR(64) COLLATE utf8_unicode_ci DEFAULT NULL,
  `ref_normalized` VARCHAR(64) AS (REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(LOWER(TRIM(`text`)), '-', ''), '/', ''), ')', ''), '!', ''), '+', ''), '"', ''), '.', ''), '''', ''), '(', ''), '|', ''), '#', ''), ',', ''), '_', ''), '*', '')) VIRTUAL,
  PRIMARY KEY (`id`),
  INDEX `idx_ref_normalized` (`ref_normalized`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

INSERT INTO `ref_test` (`text`) VALUES ("Example-Text/1!");
INSERT INTO `ref_test` (`text`) VALUES ("Another+Example\"Text.2");
INSERT INTO `ref_test` (`text`) VALUES ("More#Text|With*Characters");

Run tidb2dw for this table, as result:

[2024/01/19 15:48:15.462 +00:00] [ERROR] [snapshot.go:125] ["Failed to load snapshot data into data warehouse"] [table=database.ref_test] [error="Bigquery load snapshot job completed with error: {Location: \"\"; Message: \"Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 0; errors: 3; max bad: 0; error percent: 0\"; Reason: \"invalid\"}"] [errorVerbose="Bigquery load snapshot job completed with error: {Location: \"\"; Message: \"Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 0; errors: 3; max bad: 0; error percent: 0\"; Reason: \"invalid\"}\ngithub.com/pingcap-inc/tidb2dw/pkg/bigquerysql.loadGCSFileToBigQuery\n\t/home/eugene/work/database/tidb2dw/pkg/bigquerysql/operation.go:52\ngithub.com/pingcap-inc/tidb2dw/pkg/bigquerysql.(*BigQueryConnector).LoadSnapshot\n\t/home/eugene/work/database/tidb2dw/pkg/bigquerysql/connector.go:112\ngithub.com/pingcap-inc/tidb2dw/replicate.(*SnapshotReplicateSession).Run.func2.1\n\t/home/eugene/work/database/tidb2dw/replicate/snapshot.go:124\nruntime.goexit\n\t/usr/lib/go-1.21/src/runtime/asm_amd64.s:1650"] [path=database.ref_test.0000000010000.csv]
[2024/01/19 15:48:15.463 +00:00] [ERROR] [snapshot.go:176] ["Failed to load snapshot"] [table=database.ref_test] [error="Failed to load snapshot data into data warehouse, error files: [database.ref_test.0000000010000.csv]"] [errorVerbose="Failed to load snapshot data into data warehouse, error files: [database.ref_test.0000000010000.csv]\ngithub.com/pingcap-inc/tidb2dw/replicate.(*SnapshotReplicateSession).Run\n\t/home/eugene/work/database/tidb2dw/replicate/snapshot.go:145\ngithub.com/pingcap-inc/tidb2dw/replicate.StartReplicateSnapshot\n\t/home/eugene/work/database/tidb2dw/replicate/snapshot.go:175\ngithub.com/pingcap-inc/tidb2dw/cmd.Replicate.func1\n\t/home/eugene/work/database/tidb2dw/cmd/core.go:271\nruntime.goexit\n\t/usr/lib/go-1.21/src/runtime/asm_amd64.s:1650"]

csv:

1,"Example-Text/1!"
2,"Another+Example""Text.2"
3,"More#Text|With*Characters"

so no generated column present in csv. I tried using both stored and virtual options, but unfortunately, neither of them worked.

breezewish commented 9 months ago

Maybe we should exclude the generated column, as CDC will not output generated column at all. cc @Lloyd-Pottiger

eugen-korentsov commented 9 months ago

If CDC does not provide an option to exclude specified columns, then the only way is to exclude the generated columns. Agree with @breezewish If it is possible to ignore columns in CDC, then it might make sense to add an option to tidb2dw as a parameter to exclude columns. For example, tidb2dw --ignore-columns ref_normalized, other_column, etc.

KarynaZubko commented 9 months ago

Hello guys :) Also, I have faced with the same problem for generated column types as @eugen-korentsov described above. I will be so grateful if you can find the solution how to fix this limitation as soon as possible. Because it is the biggest pitfall for my project now.

eugen-korentsov commented 9 months ago

For now, bigquery flag --allow_jagged_rows or gcsRef.AllowJaggedRows = true in pkg/bigquerysql/operation.go helped, it just adding null value into generated column.

Lloyd-Pottiger commented 7 months ago

fixed by ce3ed47e0eb495ff01ecc6fb7dda5a7cb3bd12ba