Closed awf closed 2 years ago
RLO CI failures in Azure batch - is this expected @ryotatomioka @acl33 ?
RLO CI failures in Azure batch - is this expected @ryotatomioka @acl33 ?
I quote from the output thereof:
2021-09-01 10:16:11.964994: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
./test/determinism_test.sh: line 7: 40 Aborted (core dumped) python src/train_over_expressions.py ${REPEAT_ARGS} --num_generations=4 --run_id=foo --save_all_models
so no, I definitely don't expect to see that, but it could be an infrastructure issue?
RLO CI failures in Azure batch - is this expected @ryotatomioka @acl33 ?
I quote from the output thereof:
2021-09-01 10:16:11.964994: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1 ./test/determinism_test.sh: line 7: 40 Aborted (core dumped) python src/train_over_expressions.py ${REPEAT_ARGS} --num_generations=4 --run_id=foo --save_all_models
so no, I definitely don't expect to see that, but it could be an infrastructure issue?
Rerunning on the same commit caused it to pass
Before: vmap initialized with zeros, and added the elements
After: initialize empty, and copy values into the return slot
TODO (not in this PR): Use DPS