k2-fsa / snowfall

Moved to https://github.com/k2-fsa/icefall
Apache License 2.0
143 stars 42 forks source link

Merge changes related to ddp from pre_refactor #155

Closed danpovey closed 3 years ago

danpovey commented 3 years ago

Seems to work from a quick test (not full run, obviously)

danpovey commented 3 years ago

BTW, I did notice that both processes were writing output to the screen, which I'm not sure is the intended behavior.

danpovey commented 3 years ago

Also, unlike the distributed setup we're using in mmi_bigram_train.py, I see this:

2021-04-12 13:18:01,201 INFO [sampling.py:523] Distributed training with world size of 2 detected (node's local rank is 1. Splitting cuts into 2 partitions (this partition has cut IDs range\
 [(2784, 5567)].
2021-04-12 13:18:01,201 INFO [asr_datamodule.py:181] About to create dev dataloader
2021-04-12 13:18:01,202 INFO [mmi_att_transformer_train.py:399] About to create model
2021-04-12 13:18:01,424 INFO [asr_datamodule.py:168] About to create dev dataset
2021-04-12 13:18:01,472 INFO [sampling.py:523] Distributed training with world size of 2 detected (node's local rank is 0. Splitting cuts into 2 partitions (this partition has cut IDs range\
 [(0, 2784)].

... instead of both saying their rank is 0. Which is more like what I'd expect.

csukuangfj commented 3 years ago

... instead of both saying their rank is 0

Only one node has local_rank == 0. The log is correct, I believe.


BTW, I did notice that both processes were writing output to the screen, which I'm not sure is the intended behavior.

Yes, the two processes execute identical code, so both of them write to the console. I can change the log message to include their rank information so you know which log is from which node.

danpovey commented 3 years ago

FYI, if you get an error like this when trying to run in pdb:

TypeError: spawn() got an unexpected keyword argument '__spec__'

it seems it can be resolved by adding the following line as a workaround:

__spec__ = None
if __name__ == '__main__':
    main()

.... whether this will break something else I don't know. I'm running this with 1 job.