pachyderm / pachyderm

Data-Centric Pipelines and Data Versioning
https://www.pachyderm.com/
Apache License 2.0
6.17k stars 568 forks source link

Using incorrect backend configuration causes the worker to return an empty object #6496

Open echohack opened 3 years ago

echohack commented 3 years ago

Testing on

pachctl             2.0.0-beta.1
pachd               2.0.0-beta.1

I ran a set of tests designed to exercise PFS and grpc's max transaction size on 2.0 and immediately ran into an error before finishing the test.

You can find the test in our customer-success repo: https://github.com/pachyderm/customer-success/tree/master/testing/split-transaction -- A full readme is available there on how to run the test.

Logs are below:

~/code/pachyderm-customer/split-transaction
$ sh test.sh
editing the currently active context "local-2021-06-03-13-43-55"
could not retrieve pipeline spec file from PFS for pipeline 'regression': rpc error: code = Unknown desc = commit 3dafb57365364aed9121706f68dddc2b not found in repo regression.spec
data/housing-simplified-0.csv 222.00 b / 222.00 b [==================================================] 0s 0.00 b/s
repo test-data cannot be in the provenance of its own branch
repo test-data cannot be in the provenance of its own branch
data/housing-simplified-0.csv 222.00 b / 222.00 b [==================================================] 0s 0.00 b/s
data/housing-simplified-1.csv 224.00 b / 224.00 b [==================================================] 0s 0.00 b/s
no result set for compaction work.TaskInfo
data/housing-simplified-2.csv 226.00 b / 226.00 b [==================================================] 0s 0.00 b/s
no result set for compaction work.TaskInfo
data/housing-simplified-3.csv 228.00 b / 228.00 b [==================================================] 0s 0.00 b/s
no result set for compaction work.TaskInfo
data/housing-simplified-4.csv 230.00 b / 230.00 b [==================================================] 0s 0.00 b/s
data/housing-simplified-5.csv 232.00 b / 232.00 b [==================================================] 0s 0.00 b/s
data/housing-simplified-6.csv 234.00 b / 234.00 b [==================================================] 0s 0.00 b/s
no result set for compaction work.TaskInfo
data/housing-simplified-7.csv 236.00 b / 236.00 b [==================================================] 0s 0.00 b/s
^C
~/code/pachyderm-customer/split-transaction
$ sh test.sh
editing the currently active context "local-2021-06-03-13-43-55"
repo repo already exists
data/housing-simplified-0.csv 222.00 b / 222.00 b [==================================================] 0s 0.00 b/s
repos chain-test-data.user not found
repo test-data cannot be in the provenance of its own branch
repo test-data cannot be in the provenance of its own branch
data/housing-simplified-0.csv 222.00 b / 222.00 b [==================================================] 0s 0.00 b/s
data/housing-simplified-1.csv 224.00 b / 224.00 b [==================================================] 0s 0.00 b/s
data/housing-simplified-2.csv 226.00 b / 226.00 b [==================================================] 0s 0.00 b/s
^C

~/code/pachyderm-customer/split-transaction
$ pachctl list repo
rpc error: code = Unknown desc = no result set for compaction work.TaskInfo
echohack commented 3 years ago

Update: beta.2 release does some work to propogate this error up the stack. Now I'm seeing:

~/code/pachyderm-customer/split-transaction
$ sh test.sh
editing the currently active context "local-2021-06-03-13-43-55"
data/housing-simplified-0.csv 222.00 b / 222.00 b [============================================================================================================] 0s 0.00 b/s
repo test-data cannot be in the provenance of its own branch
repo test-data cannot be in the provenance of its own branch
data/housing-simplified-0.csv 222.00 b / 222.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-1.csv 224.00 b / 224.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-2.csv 226.00 b / 226.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-3.csv 228.00 b / 228.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-4.csv 230.00 b / 230.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-5.csv 232.00 b / 232.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-6.csv 234.00 b / 234.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-7.csv 236.00 b / 236.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-8.csv 238.00 b / 238.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-9.csv 240.00 b / 240.00 b [============================================================================================================] 0s 0.00 b/s
data/housing-simplified-10.csv 242.00 b / 242.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-11.csv 244.00 b / 244.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-12.csv 246.00 b / 246.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-13.csv 248.00 b / 248.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-14.csv 250.00 b / 250.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-15.csv 252.00 b / 252.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-16.csv 254.00 b / 254.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-17.csv 256.00 b / 256.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-18.csv 258.00 b / 258.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-19.csv 260.00 b / 260.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-20.csv 262.00 b / 262.00 b [===========================================================================================================] 0s 0.00 b/s
bad chunk. HAVE: 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 WANT: 66636236303062306461623439643634393332343732303634656535343536336261353930633361656435653537353633613533383238373635393835346233
data/housing-simplified-21.csv 264.00 b / 264.00 b [===========================================================================================================] 0s 0.00 b/s
bad chunk. HAVE: 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 WANT: 66636236303062306461623439643634393332343732303634656535343536336261353930633361656435653537353633613533383238373635393835346233
data/housing-simplified-22.csv 266.00 b / 266.00 b [===========================================================================================================] 0s 0.00 b/s
data/housing-simplified-23.csv 268.00 b / 268.00 b [===========================================================================================================] 0s 0.00 b/s
bad chunk. HAVE: 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 WANT: 66636236303062306461623439643634393332343732303634656535343536336261353930633361656435653537353633613533383238373635393835346233

On a completely fresh install.

$ pachctl version
COMPONENT           VERSION
pachctl             2.0.0-beta.2
pachd               2.0.0-beta.2
$ helm list
NAME    NAMESPACE   REVISION    UPDATED                                 STATUS      CHART                   APP VERSION
pachd   default     1           2021-07-16 12:13:16.633307 -0700 PDT    deployed    pachyderm-2.0.0-beta.2  2.0.0-beta.2
$ kubectl get all
NAME                               READY   STATUS             RESTARTS   AGE
pod/etcd-0                         1/1     Running            0          27m
pod/pachd-69dd54bcf9-tbhwc         1/1     Running            0          27m
pod/pipeline-regression-v1-rfzgp   1/2     CrashLoopBackOff   9          25m
pod/postgres-0                     1/1     Running            0          27m

NAME                                           DESIRED   CURRENT   READY   AGE
replicationcontroller/pipeline-regression-v1   1         1         0       25m

NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                                      AGE
service/etcd                     NodePort    10.98.234.249    <none>        2379:32379/TCP                                                               27m
service/etcd-headless            ClusterIP   None             <none>        2380/TCP                                                                     27m
service/kubernetes               ClusterIP   10.96.0.1        <none>        443/TCP                                                                      22d
service/pachd                    NodePort    10.97.199.164    <none>        1650:30650/TCP,1657:30657/TCP,1658:30658/TCP,1600:30600/TCP,1656:30656/TCP   27m
service/pachd-peer               ClusterIP   10.110.154.107   <none>        30653/TCP                                                                    27m
service/pipeline-regression-v1   ClusterIP   10.108.14.158    <none>        1080/TCP,9090/TCP                                                            25m
service/postgres                 ClusterIP   10.111.115.233   <none>        5432/TCP                                                                     27m
service/postgres-headless        ClusterIP   None             <none>        5432/TCP                                                                     27m

NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/pachd   1/1     1            1           27m

NAME                               DESIRED   CURRENT   READY   AGE
replicaset.apps/pachd-69dd54bcf9   1         1         1       27m

NAME                        READY   AGE
statefulset.apps/etcd       1/1     27m
statefulset.apps/postgres   1/1     27m
echohack commented 3 years ago

crash.txt

Crash logs for the pipeline container attached.

echohack commented 3 years ago

Hello,

We've discovered the root cause for this issue to occur when the deployment of Pachyderm does not correctly deploy the PostgreSQL backend.

As a result, when running pipeline operations, pachd (in the worker sidecar) will return an empty object.

Below I've included the applicable log from the worker sidecar container:

2021-07-19T21:12:20Z INFO pfs.API.GetFileTAR {"request":{"file":{"commit":{"branch":{"repo":{"name":"regression","type":"spec"},"name":"master"},"id":"72532635d41748768000dc87b7b16681"},"path":"spec"}}}
2021-07-19T21:12:20Z ERROR pfs.API.GetFileTAR {"duration":0.012058697,"error":"bad chunk. HAVE: 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 WANT: 39323733323135346664623731346339343133343861333838303063313232393334663962393731303331333062363037306365363431353838633134396535","request":{"file":{"commit":{"branch":{"repo":{"name":"regression","type":"spec"},"name":"master"},"id":"72532635d41748768000dc87b7b16681"},"path":"spec"}},"stack":["github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.verifyData\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/transform.go:193","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.Get.func1\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/transform.go:58","github.com/pachyderm/pachyderm/v2/src/internal/storage/kv.(*objectAdapter).Get.func1\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/kv/obj_adapter.go:42","github.com/pachyderm/pachyderm/v2/src/internal/storage/kv.(*objectAdapter).withBuffer\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/kv/obj_adapter.go:68","github.com/pachyderm/pachyderm/v2/src/internal/storage/kv.(*objectAdapter).Get\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/kv/obj_adapter.go:35","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.(*trackedClient).Get\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/client.go:129","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.Get\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/transform.go:57","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.(*DataReader).Get\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/reader.go:74","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.(*Reader).Get.func1\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/reader.go:46","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.(*Reader).Iterate\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/reader.go:33","github.com/pachyderm/pachyderm/v2/src/internal/storage/chunk.(*Reader).Get\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/chunk/reader.go:45","github.com/pachyderm/pachyderm/v2/src/internal/storage/fileset/index.(*levelReader).setup\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/fileset/index/reader.go:164","github.com/pachyderm/pachyderm/v2/src/internal/storage/fileset/index.(*levelReader).Read\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/fileset/index/reader.go:143","io.ReadAtLeast\n\t/usr/local/go/src/io/io.go:328","io.ReadFull\n\t/usr/local/go/src/io/io.go:347","encoding/binary.Read\n\t/usr/local/go/src/encoding/binary/binary.go:166","github.com/pachyderm/pachyderm/v2/src/internal/pbutil.(*readWriter).ReadBytes\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/pbutil/pbutil.go:39","github.com/pachyderm/pachyderm/v2/src/internal/pbutil.(*readWriter).Read\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/pbutil/pbutil.go:57","github.com/pachyderm/pachyderm/v2/src/internal/storage/fileset/index.(*Reader).Iterate\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/fileset/index/reader.go:51","github.com/pachyderm/pachyderm/v2/src/internal/storage/fileset.(*Reader).Iterate\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/fileset/reader.go:48","github.com/pachyderm/pachyderm/v2/src/internal/storage/fileset.(*Storage).Size\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/fileset/storage.go:273","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*driver).commitSize\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/driver_file.go:517","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*driver).inspectCommit\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/driver.go:993","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*driver).openCommit\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/driver_file.go:128","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*driver).getFile\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/driver_file.go:176","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*apiServer).GetFileTAR.func3\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/api_server.go:514","github.com/pachyderm/pachyderm/v2/src/internal/storage/metrics.ReportRequestWithThroughput.func1\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/metrics/metrics.go:68","github.com/pachyderm/pachyderm/v2/src/internal/storage/metrics.ReportRequest\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/metrics/metrics.go:54","github.com/pachyderm/pachyderm/v2/src/internal/storage/metrics.ReportRequestWithThroughput\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/internal/storage/metrics/metrics.go:67","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*apiServer).GetFileTAR\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/api_server.go:512","github.com/pachyderm/pachyderm/v2/src/server/pfs/server.(*validatedAPIServer).GetFileTAR\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/server/pfs/server/val_server.go:154","github.com/pachyderm/pachyderm/v2/src/pfs._API_GetFileTAR_Handler\n\t/Users/avigil/Pachyderm/Pachyderm_releases/pachyderm/src/pfs/pfs.pb.go:4951"]}
2021-07-19T21:12:20Z ERROR error starting sidecar s3 gateway: sidecar s3 gateway: could not get pipeline details: could not retrieve pipeline spec file from PFS for pipeline 'regression': rpc error: code = Unknown desc = bad chunk. HAVE: 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 WANT: 39323733323135346664623731346339343133343861333838303063313232393334663962393731303331333062363037306365363431353838633134396535; retrying in 9459766755

As you can see, the worker here is referencing 30653537353163303236653534336232653861623265623036303939646161316431653564663437373738663737383766616162343563646631326665336138 which is a chunk of an empty object.

In this scenario, the process should not be returning an empty object. Instead, an error should be checked from obj.Client.Get and a user friendly error should be returned.

Additionally, our Helm deployment process should be updated so that this scenario doesn't occur for common deployments. PR #6580 should fix the Helm part of the deployment.

There may be still undiscovered scenarios where an empty object chunk is still returned that would cause this issue.

nitinjainsj commented 3 years ago

@echohack Can you please retest with beta.5? We just fixed a race condition in #6651

If we don't have the same issue, can you mark this bug resolved?