Tests failing on NVIDIA Tesla T4, AWS G4 instance

hcho3 commented 3 years ago

Tests fail when using AWS G4 instance.

Steps to reproduce:

Set up CUDA 11.3 (latest) and NVIDIA Docker on a fresh EC2 instance of type g4dn.8xlarge.
Build the trition_fil Docker image: docker build -t triton_fil -f ops/Dockerfile .
Run the CI script: LOCAL=1 ./qa/run_tests.sh

When I switched the instance to p3.2xlarge type (V100 GPU), the tests run successfully.

Error messages:

lightgbm model

AssertionError: 
Arrays are not almost equal to 7 decimals
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 0.36781818
Max relative difference: 0.36781818
x: array([0.6321818], dtype=float32)
y: array([1.], dtype=float32)

xgboost model

AssertionError:                                                                                                                                                           
Arrays are not almost equal to 7 decimals                                                                                                                                 
Mismatched elements: 1 / 1 (100%)         
Max absolute difference: 2.                                                          
Max relative difference: 1.               
x: array([0.], dtype=float32)                                                                                                                                            
y: array([2.], dtype=float32)

wphicks commented 3 years ago

Triage data thus far:

Occurs consistently on T4s, including those not on AWS
Mismatch occurs on a small number of samples in each test but is highly consistent
- Consistent in the sense that it always occurs but it does occur on a different sample or set of samples each time, even with consistent input arrays
Breaking in on failure and re-running the failing sample results in success
Occurs for models with only 8 features (suggests this may be something other than a memory issue)
Reducing concurrency in test_model.py dramatically increases the number of failures. Increasing it to 12 in local testing made the problem go away entirely
Does not occur if Triton shared memory mode is turned off in tests
Does occur if only shared memory mode is used in tests
Throughput is notably lower (about 1/3x) when using shared memory mode than without
Using the same values for all samples still results in the same error pattern (i.e. some but not all copies of this one sample result in the incorrect output), as does using an input of all zeros
Using only a batch size of 1 reproduces the error
Adding a cudaDeviceSynchronize or cudaStreamSynchronize on the stream for the model instance's raft handle immediately after FIL prediction eliminates the error

wphicks commented 3 years ago

Another note for triaging: Throughput on tests is quite high relative to V100 and RTX8000 (roughly 3 times). This may be revealing a race condition, and if so #77 may be related.

triton-inference-server / fil_backend

Tests failing on NVIDIA Tesla T4, AWS G4 instance #80