Closed hcho3 closed 3 years ago
Triage data thus far:
test_model.py
dramatically increases the number of failures. Increasing it to 12 in local testing made the problem go away entirelycudaDeviceSynchronize
or cudaStreamSynchronize
on the stream for the model instance's raft handle immediately after FIL prediction eliminates the errorAnother note for triaging: Throughput on tests is quite high relative to V100 and RTX8000 (roughly 3 times). This may be revealing a race condition, and if so #77 may be related.
Tests fail when using AWS G4 instance.
Steps to reproduce:
docker build -t triton_fil -f ops/Dockerfile .
LOCAL=1 ./qa/run_tests.sh
When I switched the instance to p3.2xlarge type (V100 GPU), the tests run successfully.
Error messages:
lightgbm
modelxgboost
model