osalvador / ReplicaDB

ReplicaDB is open source tool for database replication, designed for efficiently transferring bulk data between relational and non-relational databases
https://osalvador.github.io/ReplicaDB/
Apache License 2.0
415 stars 97 forks source link

postgres sink - extra data after last expected column #148

Closed ScottChapman closed 7 months ago

ScottChapman commented 1 year ago

Describe the bug Running replication between MySQL -> Postgres

./bin/replicadb --options-file replicadb.conf
2023-07-03 08:49:46,554 INFO  ReplicaDB:63 Running ReplicaDB version: 0.15.0
2023-07-03 08:49:46,559 INFO  ReplicaDB:66 Setting verbose mode DEBUG
2023-07-03 08:49:46,559 DEBUG ReplicaDB:67 ToolOptions{
    sourceConnect='jdbc:mysql://myserver:3306/bugs?useSSL=true&trustServerCertificate=true',
    sourceUser='whatever',
    sourcePassword='****',
    sourceTable='bugs.attach_data',
    sourceColumns='null',
    sourceWhere='null',
    sourceQuery='null',
    sinkConnect='jdbc:postgresql://myserver:5432/bugs',
    sinkUser='whatever',
    sinkPassword='****',
    sinkTable='attach_data',
    sinkStagingTable='null',
    sinkStagingSchema='null',
    sinkStagingTableAlias='null',
    sinkColumns='null',
    sinkDisableEscape=false,
    sinkDisableIndex=false,
    sinkDisableTruncate=false,
    sinkAnalyze=false,
    jobs=5,
    BandwidthThrottling=0,
    quotedIdentifiers=false,
    fetchSize=500,
    help=false,
    version=false,
    verbose=DEBUG,
    optionsFile='replicadb.conf',
    mode='complete',
    sentryDsn='https://400e0daa08c24a399b842048c409416f@o1002451.ingest.sentry.io/5962725',
    sourceConnectionParams={},
    sinkConnectionParams={},
    sourceFileFormat='null',
    sinkFileformat='null'}
2023-07-03 08:49:46,653 INFO  Sentry:27 Sentry enabled
2023-07-03 08:49:47,608 DEBUG MySQLManager:348 Calculating the chunks size with this sql: SELECT  CEIL(count(*) / 5) chunk_size, count(*) total_rows FROM bugs.attach_data
2023-07-03 08:49:47,673 DEBUG MySQLManager:353 chunkSize: 30311 totalNumberRows: 151555
2023-07-03 08:49:47,687 INFO  SqlManager:373 Truncating sink table with this command: TRUNCATE TABLE attach_data
2023-07-03 08:49:48,400 INFO  ReplicaTask:35 Starting  TaskId-4
2023-07-03 08:49:48,400 INFO  ReplicaTask:35 Starting  TaskId-1
2023-07-03 08:49:48,400 INFO  ReplicaTask:35 Starting  TaskId-3
2023-07-03 08:49:48,400 INFO  ReplicaTask:35 Starting  TaskId-2
2023-07-03 08:49:48,400 INFO  ReplicaTask:35 Starting  TaskId-0
2023-07-03 08:49:48,943 INFO  SqlManager:128 TaskId-2: Executing SQL statement: SELECT * FROM bugs.attach_data LIMIT ? OFFSET ?
2023-07-03 08:49:48,950 INFO  SqlManager:128 TaskId-0: Executing SQL statement: SELECT * FROM bugs.attach_data LIMIT ? OFFSET ?
2023-07-03 08:49:48,950 INFO  SqlManager:128 TaskId-1: Executing SQL statement: SELECT * FROM bugs.attach_data LIMIT ? OFFSET ?
2023-07-03 08:49:48,950 INFO  SqlManager:128 TaskId-4: Executing SQL statement: SELECT * FROM bugs.attach_data LIMIT ? OFFSET ?
2023-07-03 08:49:48,957 INFO  SqlManager:128 TaskId-3: Executing SQL statement: SELECT * FROM bugs.attach_data LIMIT ? OFFSET ?
2023-07-03 08:49:48,968 DEBUG SqlManager:133 TaskId-2: Using fetchSize for next query: 500
2023-07-03 08:49:48,968 DEBUG SqlManager:133 TaskId-3: Using fetchSize for next query: 500
2023-07-03 08:49:48,968 DEBUG SqlManager:133 TaskId-1: Using fetchSize for next query: 500
2023-07-03 08:49:48,968 DEBUG SqlManager:133 TaskId-4: Using fetchSize for next query: 500
2023-07-03 08:49:48,968 DEBUG SqlManager:133 TaskId-0: Using fetchSize for next query: 500
2023-07-03 08:49:48,971 INFO  SqlManager:148 TaskId-0: With args: 30311, 0,
2023-07-03 08:49:48,971 INFO  SqlManager:148 TaskId-1: With args: 30311, 30311,
2023-07-03 08:49:48,971 INFO  SqlManager:148 TaskId-3: With args: 30311, 90933,
2023-07-03 08:49:48,971 INFO  SqlManager:148 TaskId-4: With args: 30311, 121244,
2023-07-03 08:49:48,971 INFO  SqlManager:148 TaskId-2: With args: 30311, 60622,
2023-07-03 08:49:50,925 WARN  ConnManager:188 Options source-columns and sink-columns are null, getting from Source ResultSetMetaData: id,thedata
2023-07-03 08:49:50,932 INFO  PostgresqlManager:168 Copying data with this command: COPY attach_data (id,thedata) FROM STDIN WITH DELIMITER e'\x1f' ENCODING 'UTF-8'
2023-07-03 08:49:56,373 INFO  PostgresqlManager:168 Copying data with this command: COPY attach_data (id,thedata) FROM STDIN WITH DELIMITER e'\x1f' ENCODING 'UTF-8'
2023-07-03 08:50:40,643 INFO  PostgresqlManager:168 Copying data with this command: COPY attach_data (id,thedata) FROM STDIN WITH DELIMITER e'\x1f' ENCODING 'UTF-8'
2023-07-03 08:51:43,430 INFO  PostgresqlManager:168 Copying data with this command: COPY attach_data (id,thedata) FROM STDIN WITH DELIMITER e'\x1f' ENCODING 'UTF-8'
2023-07-03 08:52:32,807 INFO  PostgresqlManager:168 Copying data with this command: COPY attach_data (id,thedata) FROM STDIN WITH DELIMITER e'\x1f' ENCODING 'UTF-8'
2023-07-03 08:56:06,350 ERROR ReplicaTask:69 ERROR in TaskId-0 inserting data to sink table: [ERROR: invalid input syntax for type bytea
  Where: COPY attach_data, line 2, column thedata: "RCS file: /repository/ghost/src/client_side.c,v
retrieving revision 1.44.2.4
diff -u -r1.44.2.4 clie..."]
2023-07-03 09:00:08,807 ERROR ReplicaTask:69 ERROR in TaskId-1 inserting data to sink table: [ERROR: extra data after last expected column
  Where: COPY attach_data, line 2: "73111��ࡱ�\N\N\N\N\N\N\N\N\N\N\N\N\N\N\N\N>\N\N��    \N\N\N\N\N\N\N\N\N\N\N\N\N\N\N�..."]
2023-07-03 09:01:22,591 ERROR ReplicaTask:69 ERROR in TaskId-2 inserting data to sink table: [ERROR: extra data after last expected column
  Where: COPY attach_data, line 4: "348304�PNG\r\n\n\N\N\N\rIHDR\N\NL\N\N\N\N\N�K\nX\N\N\NJtEXtSoftware\NRRDtool, Tobias Oe..."]
2023-07-03 09:02:54,320 ERROR ReplicaTask:69 ERROR in TaskId-4 inserting data to sink table: [ERROR: extra data after last expected column
  Where: COPY attach_data, line 2: "907571PK\N\\N\N\N!\Nv2�\N\NZ
                                                                \N\N\[Content_Types].xml �(�\N\N\N\N\N\N\N\N\N\N..."]
2023-07-03 09:07:23,519 ERROR ReplicaTask:69 ERROR in TaskId-3 inserting data to sink table: [ERROR: extra data after last expected column
  Where: COPY attach_data, line 3: "621129����\NJFIF\N\N`\N`\N\N��\NC\N
                                                                              \\n\..."]
2023-07-03 09:07:23,523 ERROR ReplicaDB:134 Got exception running ReplicaDB:
java.util.concurrent.ExecutionException: org.postgresql.util.PSQLException: ERROR: invalid input syntax for type bytea
  Where: COPY attach_data, line 2, column thedata: "RCS file: /repository/ghost/src/client_side.c,v
retrieving revision 1.44.2.4
diff -u -r1.44.2.4 clie..."
    at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
    at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
    at org.replicadb.ReplicaDB.processReplica(ReplicaDB.java:115) [ReplicaDB-0.15.0.jar:0.15.0]
    at org.replicadb.ReplicaDB.main(ReplicaDB.java:47) [ReplicaDB-0.15.0.jar:0.15.0]
Caused by: org.postgresql.util.PSQLException: ERROR: invalid input syntax for type bytea
  Where: COPY attach_data, line 2, column thedata: "RCS file: /repository/ghost/src/client_side.c,v
retrieving revision 1.44.2.4
diff -u -r1.44.2.4 clie..."
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675) ~[postgresql-42.3.7.jar:42.3.7]
    at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1263) ~[postgresql-42.3.7.jar:42.3.7]
    at org.postgresql.core.v3.QueryExecutorImpl.endCopy(QueryExecutorImpl.java:1068) ~[postgresql-42.3.7.jar:42.3.7]
    at org.postgresql.core.v3.CopyInImpl.endCopy(CopyInImpl.java:49) ~[postgresql-42.3.7.jar:42.3.7]
    at org.replicadb.manager.PostgresqlManager.insertDataToTable(PostgresqlManager.java:134) ~[ReplicaDB-0.15.0.jar:0.15.0]
    at org.replicadb.ReplicaTask.call(ReplicaTask.java:65) ~[ReplicaDB-0.15.0.jar:0.15.0]
    at org.replicadb.ReplicaTask.call(ReplicaTask.java:15) ~[ReplicaDB-0.15.0.jar:0.15.0]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    at java.lang.Thread.run(Thread.java:829) ~[?:?]
2023-07-03 09:07:23,631 INFO  ReplicaDB:54 Total process time: 1057100ms

To Reproduce Steps to reproduce the behaviour: config above

Source table DDL id = mediumint(9) PK thedata = longblob

Sink table DDL in = integer thedata = bytea

ReplicaDB configuration options-file. Expected behavior A clear and concise description of what you expected to happen.

Additional context

osalvador commented 1 year ago

Hi @ScottChapman

To open a new issue, please, follow the issue template:

Describe the bug A clear and concise description of what the bug is, with verbose log.

To Reproduce Steps to reproduce the behaviour:

  1. Source table DDL
  2. Sink table DDL
  3. ReplicaDB configuration options-file.

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here. Running environment (cloud, on premise, java version..), source and sink technologies (Oracle, MySQL, Postgres...)

ScottChapman commented 1 year ago

Hi @ScottChapman

To open a new issue, please, follow the issue template:

Describe the bug A clear and concise description of what the bug is, with verbose log.

To Reproduce Steps to reproduce the behaviour:

  1. Source table DDL
  2. Sink table DDL
  3. ReplicaDB configuration options-file.

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here. Running environment (cloud, on premise, java version..), source and sink technologies (Oracle, MySQL, Postgres...)

Sorry, updated above. I'll close the other one since both errors are in same run. Thanks!

osalvador commented 1 year ago

Hi @ScottChapman

I have made a fix that may solve your issue. Can you try the latest release? https://github.com/osalvador/ReplicaDB/releases/tag/v0.15.1

Regards!

ScottChapman commented 1 year ago

Yes, better. I do get the error but much less frequently. Is it possible to dump any information that might be helpful in identifying the data that is causing the problem for me?

ScottChapman commented 1 year ago

@osalvador - are the copy commands kept some place? I assume there is a tmp file for the SQL commands that are executed? Is it possible to preserve that for debugging purposes?

ScottChapman commented 1 year ago

OK. I think I narrowed it down to the data that is problematic. There is a bunch of unicode data in it. I can probably clean it up a bit so I can share it if you are interested.

osalvador commented 1 year ago

Hi @ScottChapman

How was it?