snuspl / parallax

A Tool for Automatic Parallelization of Deep Learning Training in Distributed Multi-GPU Environments.
Apache License 2.0
130 stars 35 forks source link

[Parallax-6] Add Hybrid Communication #11

Closed sj6077 closed 6 years ago

sj6077 commented 6 years ago

Github issue: #6

Major changes:

Minor changes to note:

Tests for the changes:

Other comments:

Hybrid Throughput (M1 / M4) ResNet-50(images/sec) - 1k / 4.0k LM(words/sec) - 99.6k / 244k NMT(words/sec) - 41.82k / 127k

Previous Hybrid Throughput (M1/ M4) ResNet-50(images/sec) - 1k / 3.9k LM(words/sec) - 91.7k / 255k NMT(words/sec) - 40.5k / 123 k

LM-1B convergence with Hybrid(M6) [(0, 793384.0945), (866, 308.3383), (1732, 150.8329), (2598, 106.3734), (3464, 87.6624), (4330, 77.8357), (5196, 71.9195), (6062, 67.8373), (6928, 64.8923), (7794, 62.6418), (8660, 60.7753), (9526, 59.3207), (10392, 58.1032), (11258, 57.0601), (12124, 56.131), (12990, 55.3591), (13856, 54.66), (14722, 54.0232), (15588, 53.4404), (16454, 52.9073), (17320, 52.4379), (18186, 51.9696), (19052, 51.5936), (19918, 51.3194), (20784, 51.0416), (21650, 50.7354), (22516, 50.4531), (23382, 50.1908), (24248, 49.9236), (25114, 49.6646), (25980, 49.4715), (26846, 49.2843), (27712, 49.0637), (28578, 48.9052)]

Previous LM-1B convergence with Hybrid (M6) perplexity_HYBRID = [(866, 319.2579), (1732, 155.7456), (2598, 109.1768), (3464, 89.4128), (4330, 79.0635), (5196, 72.9961), (6062, 68.6772), (6928, 65.536), (7794, 63.1335), (8660, 61.1923), (9526, 59.5879), (10392, 58.2855), (11258, 57.1746), (12124, 56.2273), (12990, 55.3832), (13856, 54.7883), (14722, 54.1316), (15588, 53.7006), (16454, 53.1228), (17320, 52.6287), (18186, 52.1844), (19052, 51.7674), (19918, 51.5054), (20784, 51.2635), (21650, 50.9522), (22516, 50.6598), (23382, 50.3806), (24248, 50.0781), (25114, 49.8031), (25980, 49.565), (26846, 49.3686), (27712, 49.1397), (28578, 48.9546)]

resolves #6

pigbug419 commented 6 years ago

Could you update explanation of run_option in parallax_api.md?

sj6077 commented 6 years ago

@pigbug419 I addressed your comments. Can you update the README and merge it?

pigbug419 commented 6 years ago

I already updated the figure in README. LGTM!