privacysandbox / protected-auction-key-value-service

Protected Auction Key/Value Service
Apache License 2.0
54 stars 20 forks source link

Timeout when loading a large WASM file #58

Closed fhoering closed 3 months ago

fhoering commented 3 months ago

When loading a large WASM file (~30 MB) we are getting this error:

E0604 08:09:32.787892     513 riegeli_stream_io.h:334] Record callback failed to process some records with: INTERNAL: Timed out setting UDF code object.
I0604 08:09:32.787969     513 riegeli_stream_io.h:342] Done reading 1 records in shard: [0,8388608] in 1292.13 ms.

It prevents using this file for UDFs. Is it possible to increase this timeout somehow ?

peiwenhu commented 3 months ago

We're looking into this and will get back to you asap

peiwenhu commented 3 months ago

Hi,

If you run this locally, could you invoke it with flag --udf_timeout=120s or some large duration (currently the timeout is only 5s). Using the quick start guide as an example this flag would need to be added here like other flags.

If you run this on AWS, could you change this parameter in parameter store

peiwenhu commented 3 months ago

Sorry I meant --udf_update_timeout=120s for local

fhoering commented 3 months ago

OK. Thanks. @peiwenhu. I can try this yes. But does this actually mean that my UDF takes more than 5 seconds to reply ?

Because the error happens when the DELTA file is loaded, not when the request is executed.

Given the stack trace I was thinking it might come from the riegeli record reader for example: (https://github.com/google/riegeli/blob/914e276f53b9c3b18528cbf8908f1ff92310195e/riegeli/base/parallelism.cc#L61)

peiwenhu commented 3 months ago

The UDF is loaded when it's DELTA file is loaded so that's expected.

There are a few possibilities. It's not easy for us to conclude yet with only the information available here. By increasing the timeout we will have a better understanding. On our end, loading 30mb doesn't time out. So it might be a difference between your machine and ours or some other reason. If after increasing timeout to a sufficiently large number you still run into this issue, that'd be a strong indicator that something else is fishy.

fhoering commented 3 months ago

Ok. I will try this. I got confused at one point because this flag is only available in the main branch not in 0.16 release. But I will build a new binary.

fhoering commented 3 months ago

@peiwenhu Ok. It seems better. I'm able to run this WASM file now.

Did the logs change in the latest version ? Before I was running like this to get all logs including JS output:

--logging_verbosity_level=9 --stderrthreshold=0
lusayaa commented 3 months ago

Since you are running locally, could you build with the nonprod_mode flag? The JS logs won't show in prod mode, but will in nonprod_mode.

Here are the updated instructions for building locally: https://github.com/privacysandbox/protected-auction-key-value-service/blob/main/docs/deployment/deploying_locally.md#build-the-local-binary

fhoering commented 3 months ago

@lusayaa I tried to do compile a new version with this flag without success. Compiled main with:

./builders/tools/bazel-debian build //components/data_server/server:server \
  --config=local_instance \
  --config=local_platform \
  --config=nonprod_mode
./server --delta_directory=/tmp/deltas --realtime_directory=/tmp/realtime --port=123 --route_v1_to_v2=true --logging_verbosity_level=9 --stderrthreshold=0 --udf_update_timeout=120s --udf_timeout=120s

Those are the logs I see but a lot are missing:

Warning: server token is not set, consent log is turned off.
I0610 11:22:10.599415     101 sandbox.cc:322] CALL ENTRY: 'RunCodeFromSerializedData' with 3 argument(s)
I0610 11:22:10.599750     101 sandbox.cc:277] Synchronization (TO), ptr 0x3516ffa05ec0, Type: 3 for var: LenVal
I0610 11:22:10.599779     101 var_abstract.cc:96] TransferToSandboxee for: Structure of size: 16, local: 0x3516ffa15580, remote: 0x5573dc2f7420, size: 16
I0610 11:22:10.599793     101 var_abstract.cc:96] TransferToSandboxee for: Array, elem size: 1 B., total size: 0 B., nelems: 0, local: 0, remote: 0x5573dc2e6330, size: 0
I0610 11:22:10.599797     101 sandbox.cc:369] CALL ARG: (0), Type: Pointer, Size: 8, Val: Ptr to obj:0x3516ffa154a0 (type:'LengthValue' val:'LenVal'), local:(nil), remote:0x5573dc2f7420, size:0
I0610 11:22:10.599801     101 sandbox.cc:369] CALL ARG: (1), Type: Integer, Size: 4, Val: 296
I0610 11:22:10.599903     101 sandbox.cc:369] CALL ARG: (2), Type: Pointer, Size: 8, Val: Ptr to obj:0x7f56814f5708 (type:'Integer' val:'0'), local:0x7f56814f5748, remote:0x5573dc3a6850, size:8
I0610 11:22:12.928461     101 sandbox.cc:296] Synchronization (FROM), ptr 0x3516ffa05ec0, Type: 3 for var: LenVal
I0610 11:22:12.928508     101 var_abstract.cc:133] TransferFromSandboxee for: Structure of size: 16, local: 0x3516ffa15580, remote: 0x5573dc2f7420, size: 16
I0610 11:22:12.928524     101 var_abstract.cc:133] TransferFromSandboxee for: Array, elem size: 1 B., total size: 0 B., nelems: 0, local: 0x3516ffa02380, remote: 0x5573dc2e6330, size: 0
I0610 11:22:12.928527     101 sandbox.cc:296] Synchronization (FROM), ptr 0x3516ffa20e40, Type: 2 for var: 0
I0610 11:22:12.928531     101 var_abstract.cc:133] TransferFromSandboxee for: 0, local: 0x7f56814f5748, remote: 0x5573dc3a6850, size: 8
I0610 11:22:12.928533     101 sandbox.cc:398] CALL EXIT: Type: Integer, Size: 4, Val: 0

It worked before. So something must be broken. Can you have a look please ?

emmafu2022 commented 3 months ago

Hi @fhoering , are you able to see other server log lines such as data loading? Specifically are you able to see the log line about setting UDF object, loading UDF delta file is done and server starts watching new files, something like the following lines:

I0610 10:03:11.123586 2428735 data_orchestrator.cc:244] Loading /tmp/deltas/<your UDF delta file>
...
I0610 10:03:11.124105 2428825 data_orchestrator.cc:214] Setting UDF code snippet for version: 2, handler: HandleRequest,
...
I0610 10:03:11.134653 2428735 data_orchestrator.cc:360] Done loading /tmp/deltas/<your UDF delta file>
...
I0610 10:03:16.135478 2428826 change_notifier_local.cc:57] Watching for new files in directory: /tmp/deltas
I0610 10:03:21.135626 2428826 change_notifier_local.cc:67] No new files found within timeout, stopping.
I0610 10:03:21.135661 2428826 delta_file_notifier.cc:134] Backup poll
I0610 10:03:21.135743 2428826 delta_file_notifier.cc:233] No new file found
I0610 10:03:21.135761 2428826 change_notifier_local.cc:57] Watching for new files in directory: /tmp/deltas
fhoering commented 3 months ago

No before on branch 0.16 I saw those log lines indeed. But with the instructions above on main branch I don't see them anymore.

Did you try to recompile that exact same way I did ?

emmafu2022 commented 3 months ago

@fhoering Yes, I compiled with exact command you used and I am able to see log lines I mentioned before.

To rule out the possibility of machine difference causing the issue. Could you please try to run the server locally in docker?

Commands to compile and run:

builders/tools/bazel-debian run //production/packaging/local/data_server:copy_to_dist --config=local_instance --config=local_platform --config=nonprod_mode
docker load -i dist/server_docker_image.tar
docker run -it --network=host -entrypoint=/server --init --rm --volume=/tmp/deltas:/tmp/deltas --volume=/tmp/realtime:/tmp/realtime --security-opt=seccomp=unconfined bazel/production/packaging/local/data_server:server_docker_image --port 50051 -route_v1_to_v2=true -logging_verbosity_level=9 -stderrthreshold=0 -udf_update_timeout=120s -udf_timeout=120s -delta_directory=/tmp/deltas -realtime_directory=/tmp/realtime
fhoering commented 3 months ago

@emmafu2022 With the latest instructions it seems fine. But even the artifact published in bazel-bin/components/data_server/server/server works now with a blank ubuntu:20.04 docker image. Not sure why exactly. So I will close for now and reopen if necessary. Thanks