Closed BiduCui closed 1 year ago
I am using the code from https://github.com/mlcommons/training_results_v2.1/tree/main/NVIDIA/benchmarks/bert/implementations/pytorch-22.09 Following the README instructions, I downloaded the dataset from the provided Google Drive link https://drive.google.com/drive/u/0/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT and executed the prepare_data.sh script. I performed BERT training on 8 machines, with each machine equipped with 8 A800 GPUs and 8 Mellanox network cards with 100G capacity. However, I noticed that even after training for multiple epochs, my eval_acc consistently oscillates around 0.63.
Interestingly, when I removed the shuffle-related code from prepare_data.sh by commenting out lines 114 to 123
and instead used old non-shuffled commands for dataset preprocessing,
mkdir -p ${DATADIR}/hdf5/training-${SHARDS}/hdf5_${SHARDS}_shardsvarlength CPUS=$( ls -d /sys/devices/system/cpu/cpu[[:digit:]]* | wc -w ) CPUS=$((CPUS / 2)) ls -1 ${DATADIR}/hdf5/training-${SHARDS}/hdf5${SHARDS}_shards_uncompressed | \ xargs --max-args=1 --max-procs=${CPUS} -I{} python3 ${SCRIPT_DIR}/convert_fixed2variable.py \ --input_hdf5file ${DATADIR}/hdf5/training-${SHARDS}/hdf5${SHARDS}_shards_uncompressed/{} \ --output_hdf5file ${DATADIR}/hdf5/training-${SHARDS}/hdf5${SHARDS}_shards_varlength/{}
I observed that eval_acc converged to 0.72. Based on my understanding, shuffling should not significantly impact the training accuracy of BERT, and this has left me puzzled.
I am using the code from https://github.com/mlcommons/training_results_v2.1/tree/main/NVIDIA/benchmarks/bert/implementations/pytorch-22.09 Following the README instructions, I downloaded the dataset from the provided Google Drive link https://drive.google.com/drive/u/0/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT and executed the prepare_data.sh script. I performed BERT training on 8 machines, with each machine equipped with 8 A800 GPUs and 8 Mellanox network cards with 100G capacity. However, I noticed that even after training for multiple epochs, my eval_acc consistently oscillates around 0.63.
Interestingly, when I removed the shuffle-related code from prepare_data.sh by commenting out lines 114 to 123
and instead used old non-shuffled commands for dataset preprocessing,
Here is the old non-shuffled version
mkdir -p ${DATADIR}/hdf5/training-${SHARDS}/hdf5_${SHARDS}_shardsvarlength CPUS=$( ls -d /sys/devices/system/cpu/cpu[[:digit:]]* | wc -w ) CPUS=$((CPUS / 2)) ls -1 ${DATADIR}/hdf5/training-${SHARDS}/hdf5${SHARDS}_shards_uncompressed | \ xargs --max-args=1 --max-procs=${CPUS} -I{} python3 ${SCRIPT_DIR}/convert_fixed2variable.py \ --input_hdf5file ${DATADIR}/hdf5/training-${SHARDS}/hdf5${SHARDS}_shards_uncompressed/{} \ --output_hdf5file ${DATADIR}/hdf5/training-${SHARDS}/hdf5${SHARDS}_shards_varlength/{}
I observed that eval_acc converged to 0.72. Based on my understanding, shuffling should not significantly impact the training accuracy of BERT, and this has left me puzzled.