mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 548 forks source link

【Bert】Unable to achieve accuracy of 0.72. #659

Closed BiduCui closed 7 months ago

BiduCui commented 1 year ago

I am using the code from https://github.com/mlcommons/training_results_v2.1/tree/main/NVIDIA/benchmarks/bert/implementations/pytorch-22.09 Following the README instructions, I downloaded the dataset from the provided Google Drive link https://drive.google.com/drive/u/0/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT and executed the prepare_data.sh script. I performed BERT training on 8 machines, with each machine equipped with 8 A800 GPUs and 8 Mellanox network cards with 100G capacity. However, I noticed that even after training for multiple epochs, my eval_acc consistently oscillates around 0.63.

Interestingly, when I removed the shuffle-related code from prepare_data.sh by commenting out lines 114 to 123

image

and instead used old non-shuffled commands for dataset preprocessing,

Here is the old non-shuffled version

mkdir -p ${DATADIR}/hdf5/training-${SHARDS}/hdf5_${SHARDS}_shardsvarlength CPUS=$( ls -d /sys/devices/system/cpu/cpu[[:digit:]]* | wc -w ) CPUS=$((CPUS / 2)) ls -1 ${DATADIR}/hdf5/training-${SHARDS}/hdf5${SHARDS}_shards_uncompressed | \ xargs --max-args=1 --max-procs=${CPUS} -I{} python3 ${SCRIPT_DIR}/convert_fixed2variable.py \ --input_hdf5file ${DATADIR}/hdf5/training-${SHARDS}/hdf5${SHARDS}_shards_uncompressed/{} \ --output_hdf5file ${DATADIR}/hdf5/training-${SHARDS}/hdf5${SHARDS}_shards_varlength/{}

I observed that eval_acc converged to 0.72. Based on my understanding, shuffling should not significantly impact the training accuracy of BERT, and this has left me puzzled.