k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
792 stars 267 forks source link

Identical Batches Across Multiple GPUs #1616

Closed yfyeung closed 1 day ago

yfyeung commented 2 weeks ago

branch: current master environment: torch2.0.1 + Python3.8.19 + cuda11.8 recipe: egs/librispeech/zipformer commend:

./zipformer/train.py   --world-size 4   --num-epochs 30   --start-epoch 1   --use-fp16 1   --exp-dir zipformer/exp   --full-libri 1   --max-duration 1000  --shuffle 0  --enable-musan 0

log:

2024-05-04 08:36:29,461 INFO [train.py:943] (2/4) ['1183-133255-0004-2639', '6072-54656-0015-11450', '2570-157243-0029-58394', '5570-73846-0017-8368', '6102-56170-0057-121783', '7384-84010-0032-29225', '1636-141789-0029-146603', '6030-57827-0025-137357', '925-8140-0026-31343', '3630-11612-0025-40499', '1740-141148-0112-29396', '177-55218-0004-79755', '3744-178594-0004-119696', '5132-33409-0006-20988', '7307-276146-0042-96515', '4779-85498-0030-45921', '551-129024-0078-41444', '770-43321-0006-32993', '899-126232-0034-3616', '7046-85651-0033-71616', '4546-16813-0000-29318', '6385-220959-0032-6973', '8630-305212-0008-6055', '1296-138074-0018-51711', '441-130108-0010-28107', '4770-25951-0040-81888', '8635-295759-0012-32578', '8609-283227-0050-21705', '7258-91906-0011-25873', '6078-54007-0042-18565', '30-4447-0029-26332', '8846-305208-0030-144443', '1414-130538-0019-139813', '1031-133220-0056-96355', '5350-205002-0026-97100', '3368-170952-0006-9173', '6010-56788-0052-6431', '7383-95441-0081-7985', '1776-142744-0041-79419', '7507-100463-0004-99602', '4071-39913-0047-49654', '8152-258993-0009-31952', '782-126738-0120-84935', '1289-288044-0041-55926', '3261-154309-0075-39003', '500-125123-0114-70901', '8536-244441-0073-85073', '5588-68188-0033-90416', '5002-70998-0027-90632', '5874-52159-0049-144807', '2085-147970-0002-96999', '7285-72207-0005-74263', '8044-84200-0048-12961', '317-130248-0026-140457', '7198-80654-0017-64455', '1184-135532-0021-75772', '7484-39971-0000-50886', '5220-69519-0009-100746', '3928-10094-0020-66058', '6075-57156-0035-44850', '6743-72306-0054-22134', '3588-180957-0002-61885']
2024-05-04 08:36:29,778 INFO [train.py:943] (1/4) ['1183-133255-0004-2639', '6072-54656-0015-11450', '2570-157243-0029-58394', '5570-73846-0017-8368', '6102-56170-0057-121783', '7384-84010-0032-29225', '1636-141789-0029-146603', '6030-57827-0025-137357', '925-8140-0026-31343', '3630-11612-0025-40499', '1740-141148-0112-29396', '177-55218-0004-79755', '3744-178594-0004-119696', '5132-33409-0006-20988', '7307-276146-0042-96515', '4779-85498-0030-45921', '551-129024-0078-41444', '770-43321-0006-32993', '899-126232-0034-3616', '7046-85651-0033-71616', '4546-16813-0000-29318', '6385-220959-0032-6973', '8630-305212-0008-6055', '1296-138074-0018-51711', '441-130108-0010-28107', '4770-25951-0040-81888', '8635-295759-0012-32578', '8609-283227-0050-21705', '7258-91906-0011-25873', '6078-54007-0042-18565', '30-4447-0029-26332', '8846-305208-0030-144443', '1414-130538-0019-139813', '1031-133220-0056-96355', '5350-205002-0026-97100', '3368-170952-0006-9173', '6010-56788-0052-6431', '7383-95441-0081-7985', '1776-142744-0041-79419', '7507-100463-0004-99602', '4071-39913-0047-49654', '8152-258993-0009-31952', '782-126738-0120-84935', '1289-288044-0041-55926', '3261-154309-0075-39003', '500-125123-0114-70901', '8536-244441-0073-85073', '5588-68188-0033-90416', '5002-70998-0027-90632', '5874-52159-0049-144807', '2085-147970-0002-96999', '7285-72207-0005-74263', '8044-84200-0048-12961', '317-130248-0026-140457', '7198-80654-0017-64455', '1184-135532-0021-75772', '7484-39971-0000-50886', '5220-69519-0009-100746', '3928-10094-0020-66058', '6075-57156-0035-44850', '6743-72306-0054-22134', '3588-180957-0002-61885']
2024-05-04 08:36:29,798 INFO [train.py:943] (3/4) ['1183-133255-0004-2639', '6072-54656-0015-11450', '2570-157243-0029-58394', '5570-73846-0017-8368', '6102-56170-0057-121783', '7384-84010-0032-29225', '1636-141789-0029-146603', '6030-57827-0025-137357', '925-8140-0026-31343', '3630-11612-0025-40499', '1740-141148-0112-29396', '177-55218-0004-79755', '3744-178594-0004-119696', '5132-33409-0006-20988', '7307-276146-0042-96515', '4779-85498-0030-45921', '551-129024-0078-41444', '770-43321-0006-32993', '899-126232-0034-3616', '7046-85651-0033-71616', '4546-16813-0000-29318', '6385-220959-0032-6973', '8630-305212-0008-6055', '1296-138074-0018-51711', '441-130108-0010-28107', '4770-25951-0040-81888', '8635-295759-0012-32578', '8609-283227-0050-21705', '7258-91906-0011-25873', '6078-54007-0042-18565', '30-4447-0029-26332', '8846-305208-0030-144443', '1414-130538-0019-139813', '1031-133220-0056-96355', '5350-205002-0026-97100', '3368-170952-0006-9173', '6010-56788-0052-6431', '7383-95441-0081-7985', '1776-142744-0041-79419', '7507-100463-0004-99602', '4071-39913-0047-49654', '8152-258993-0009-31952', '782-126738-0120-84935', '1289-288044-0041-55926', '3261-154309-0075-39003', '500-125123-0114-70901', '8536-244441-0073-85073', '5588-68188-0033-90416', '5002-70998-0027-90632', '5874-52159-0049-144807', '2085-147970-0002-96999', '7285-72207-0005-74263', '8044-84200-0048-12961', '317-130248-0026-140457', '7198-80654-0017-64455', '1184-135532-0021-75772', '7484-39971-0000-50886', '5220-69519-0009-100746', '3928-10094-0020-66058', '6075-57156-0035-44850', '6743-72306-0054-22134', '3588-180957-0002-61885']
2024-05-04 08:36:29,931 INFO [train.py:943] (0/4) ['1183-133255-0004-2639', '6072-54656-0015-11450', '2570-157243-0029-58394', '5570-73846-0017-8368', '6102-56170-0057-121783', '7384-84010-0032-29225', '1636-141789-0029-146603', '6030-57827-0025-137357', '925-8140-0026-31343', '3630-11612-0025-40499', '1740-141148-0112-29396', '177-55218-0004-79755', '3744-178594-0004-119696', '5132-33409-0006-20988', '7307-276146-0042-96515', '4779-85498-0030-45921', '551-129024-0078-41444', '770-43321-0006-32993', '899-126232-0034-3616', '7046-85651-0033-71616', '4546-16813-0000-29318', '6385-220959-0032-6973', '8630-305212-0008-6055', '1296-138074-0018-51711', '441-130108-0010-28107', '4770-25951-0040-81888', '8635-295759-0012-32578', '8609-283227-0050-21705', '7258-91906-0011-25873', '6078-54007-0042-18565', '30-4447-0029-26332', '8846-305208-0030-144443', '1414-130538-0019-139813', '1031-133220-0056-96355', '5350-205002-0026-97100', '3368-170952-0006-9173', '6010-56788-0052-6431', '7383-95441-0081-7985', '1776-142744-0041-79419', '7507-100463-0004-99602', '4071-39913-0047-49654', '8152-258993-0009-31952', '782-126738-0120-84935', '1289-288044-0041-55926', '3261-154309-0075-39003', '500-125123-0114-70901', '8536-244441-0073-85073', '5588-68188-0033-90416', '5002-70998-0027-90632', '5874-52159-0049-144807', '2085-147970-0002-96999', '7285-72207-0005-74263', '8044-84200-0048-12961', '317-130248-0026-140457', '7198-80654-0017-64455', '1184-135532-0021-75772', '7484-39971-0000-50886', '5220-69519-0009-100746', '3928-10094-0020-66058', '6075-57156-0035-44850', '6743-72306-0054-22134', '3588-180957-0002-61885']
pzelasko commented 1 week ago

Can you explicitly pass rank and world_size arguments to the sampler and see if the issue persists? It attempts auto-detection but maybe it failed for some reason in your configuration (if that’s the case let’s try to find out why).

yfyeung commented 1 week ago

Can you explicitly pass rank and world_size arguments to the sampler and see if the issue persists? It attempts auto-detection but maybe it failed for some reason in your configuration (if that’s the case let’s try to find out why).

After doing that, it's working as expected. It seems this issue occurs only on the virtual machine node, likely due to the specific environment of the machine itself.