Open echohack opened 3 years ago
Update: beta.2 release does some work to propogate this error up the stack. Now I'm seeing:
~/code/pachyderm-customer/split-transaction
$ sh test.sh
editing the currently active context "local-2021-06-03-13-43-55"
data/housing-simplified-0.csv 222.00 b / 222.00 b [============================================================================================================] 0s 0.00 b/s
repo test-data cannot be in the provenance of its own branch
repo test-data cannot be in the provenance of its own branch
data/housing-simplified-0.csv 222.00 b / 222.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-1.csv 224.00 b / 224.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-2.csv 226.00 b / 226.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-3.csv 228.00 b / 228.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-4.csv 230.00 b / 230.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-5.csv 232.00 b / 232.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-6.csv 234.00 b / 234.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-7.csv 236.00 b / 236.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-8.csv 238.00 b / 238.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-9.csv 240.00 b / 240.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-10.csv 242.00 b / 242.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-11.csv 244.00 b / 244.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-12.csv 246.00 b / 246.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-13.csv 248.00 b / 248.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-14.csv 250.00 b / 250.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-15.csv 252.00 b / 252.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-16.csv 254.00 b / 254.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-17.csv 256.00 b / 256.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-18.csv 258.00 b / 258.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-19.csv 260.00 b / 260.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-20.csv 262.00 b / 262.00 b [===========================================================================================================] 0s 0.00 b/s
bad chunk. HAVE: 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 WANT: 66636236303062306461623439643634393332343732303634656535343536336261353930633361656435653537353633613533383238373635393835346233
data/housing-simplified-21.csv 264.00 b / 264.00 b [===========================================================================================================] 0s 0.00 b/s
bad chunk. HAVE: 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 WANT: 66636236303062306461623439643634393332343732303634656535343536336261353930633361656435653537353633613533383238373635393835346233
data/housing-simplified-22.csv 266.00 b / 266.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-23.csv 268.00 b / 268.00 b [===========================================================================================================] 0s 0.00 b/s
bad chunk. HAVE: 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 WANT: 66636236303062306461623439643634393332343732303634656535343536336261353930633361656435653537353633613533383238373635393835346233
On a completely fresh install.
$ pachctl version
COMPONENT VERSION
pachctl 2.0.0-beta.2
pachd 2.0.0-beta.2
$ helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
pachd default 1 2021-07-16 12:13:16.633307 -0700 PDT deployed pachyderm-2.0.0-beta.2 2.0.0-beta.2
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/etcd-0 1/1 Running 0 27m
pod/pachd-69dd54bcf9-tbhwc 1/1 Running 0 27m
pod/pipeline-regression-v1-rfzgp 1/2 CrashLoopBackOff 9 25m
pod/postgres-0 1/1 Running 0 27m
NAME DESIRED CURRENT READY AGE
replicationcontroller/pipeline-regression-v1 1 1 0 25m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/etcd NodePort 10.98.234.249 <none> 2379:32379/TCP 27m
service/etcd-headless ClusterIP None <none> 2380/TCP 27m
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 22d
service/pachd NodePort 10.97.199.164 <none> 1650:30650/TCP,1657:30657/TCP,1658:30658/TCP,1600:30600/TCP,1656:30656/TCP 27m
service/pachd-peer ClusterIP 10.110.154.107 <none> 30653/TCP 27m
service/pipeline-regression-v1 ClusterIP 10.108.14.158 <none> 1080/TCP,9090/TCP 25m
service/postgres ClusterIP 10.111.115.233 <none> 5432/TCP 27m
service/postgres-headless ClusterIP None <none> 5432/TCP 27m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/pachd 1/1 1 1 27m
NAME DESIRED CURRENT READY AGE
replicaset.apps/pachd-69dd54bcf9 1 1 1 27m
NAME READY AGE
statefulset.apps/etcd 1/1 27m
statefulset.apps/postgres 1/1 27m
Hello,
We've discovered the root cause for this issue to occur when the deployment of Pachyderm does not correctly deploy the PostgreSQL backend.
As a result, when running pipeline operations, pachd (in the worker sidecar) will return an empty object.
Below I've included the applicable log from the worker sidecar container:
2021-07-19T21:12:20Z INFO pfs.API.GetFileTAR {"request":{"file":{"commit":{"branch":{"repo":{"name":"regression","type":"spec"},"name":"master"},"id":"72532635d41748768000dc87b7b16681"},"path":"spec"}}}
2021-07-19T21:12:20Z ERROR pfs.API.GetFileTAR {"duration":0.012058697,"error":"bad chunk. HAVE: 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 WANT: 39323733323135346664623731346339343133343861333838303063313232393334663962393731303331333062363037306365363431353838633134396535","request":{"file":{"commit":{"branch":{"repo":{"name":"regression","type":"spec"},"name":"master"},"id":"72532635d41748768000dc87b7b16681"},"path":"spec"}},"stack":["github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.verifyData\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/transform.go:193","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.Get.func1\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/transform.go:58","github.com/pachyderm/pachyderm/v2/src/internal/storage/kv.(*objectAdapter).Get.func1\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/kv/obj_adapter.go:42","github.com/pachyderm/pachyderm/v2/src/internal/storage/kv.(*objectAdapter).withBuffer\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/kv/obj_adapter.go:68","github.com/pachyderm/pachyderm/v2/src/internal/storage/kv.(*objectAdapter).Get\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/kv/obj_adapter.go:35","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.(*trackedClient).Get\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/client.go:129","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.Get\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/transform.go:57","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.(*DataReader).Get\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/reader.go:74","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.(*Reader).Get.func1\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/reader.go:46","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.(*Reader).Iterate\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/reader.go:33","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.(*Reader).Get\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/reader.go:45","github.com/pachyderm/pachyderm/v2/src/internal/storage/fileset/index.(*levelReader).setup\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/fileset/index/reader.go:164","github.com/pachyderm/pachyderm/v2/src/internal/storage/fileset/index.(*levelReader).Read\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/fileset/index/reader.go:143","io.ReadAtLeast\n\t/usr/local/go/src/io/io.go:328","io.ReadFull\n\t/usr/local/go/src/io/io.go:347","encoding/binary.Read\n\t/usr/local/go/src/encoding/binary/binary.go:166","github.com/pachyderm/pachyderm/v2/src/internal/pbutil.(*readWriter).ReadBytes\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/pbutil/pbutil.go:39","github.com/pachyderm/pachyderm/v2/src/internal/pbutil.(*readWriter).Read\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/pbutil/pbutil.go:57","github.com/pachyderm/pachyderm/v2/src/internal/storage/fileset/index.(*Reader).Iterate\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/fileset/index/reader.go:51","github.com/pachyderm/pachyderm/v2/src/internal/storage/fileset.(*Reader).Iterate\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/fileset/reader.go:48","github.com/pachyderm/pachyderm/v2/src/internal/storage/fileset.(*Storage).Size\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/fileset/storage.go:273","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*driver).commitSize\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/driver_file.go:517","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*driver).inspectCommit\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/driver.go:993","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*driver).openCommit\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/driver_file.go:128","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*driver).getFile\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/driver_file.go:176","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*apiServer).GetFileTAR.func3\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/api_server.go:514","github.com/pachyderm/pachyderm/v2/src/internal/storage/metrics.ReportRequestWithThroughput.func1\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/metrics/metrics.go:68","github.com/pachyderm/pachyderm/v2/src/internal/storage/metrics.ReportRequest\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/metrics/metrics.go:54","github.com/pachyderm/pachyderm/v2/src/internal/storage/metrics.ReportRequestWithThroughput\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/metrics/metrics.go:67","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*apiServer).GetFileTAR\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/api_server.go:512","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*validatedAPIServer).GetFileTAR\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/val_server.go:154","github.com/pachyderm/pachyderm/v2/src/pfs._API_GetFileTAR_Handler\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/pfs/pfs.pb.go:4951"]}
2021-07-19T21:12:20Z ERROR error starting sidecar s3 gateway: sidecar s3 gateway: could not get pipeline details: could not retrieve pipeline spec file from PFS for pipeline 'regression': rpc error: code = Unknown desc = bad chunk. HAVE: 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 WANT: 39323733323135346664623731346339343133343861333838303063313232393334663962393731303331333062363037306365363431353838633134396535; retrying in 9459766755
As you can see, the worker here is referencing 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138
which is a chunk of an empty object.
In this scenario, the process should not be returning an empty object. Instead, an error should be checked from obj.Client.Get
and a user friendly error should be returned.
Additionally, our Helm deployment process should be updated so that this scenario doesn't occur for common deployments. PR #6580 should fix the Helm part of the deployment.
There may be still undiscovered scenarios where an empty object chunk is still returned that would cause this issue.
@echohack Can you please retest with beta.5? We just fixed a race condition in #6651
If we don't have the same issue, can you mark this bug resolved?
Testing on
I ran a set of tests designed to exercise PFS and grpc's max transaction size on 2.0 and immediately ran into an error before finishing the test.
You can find the test in our customer-success repo: https://github.com/pachyderm/customer-success/tree/master/testing/split-transaction -- A full readme is available there on how to run the test.
Logs are below: