Closed jrp2014 closed 4 years ago
Thank you, I appreciate your feedback.
1) Yes, there was a large change to Massiv API. Day 4 was written with massiv-0.3.2.1, whereas Day 5 already relies on massiv-0.4.4.0 (see comments in https://github.com/masterdezign/10-days-of-grad/blob/master/day5/src/NeuralNetwork.hs#L3).
Several possible ways how to resolve the compilation issue:
I don't think I will have time to upgrade the code to the latest Massiv. But pull requests are very welcome ;)
2) That is correct, Day 5 needs some profiling to be done. Again, depends if I have some extra time.
The posts are intended to be educational, so the code can be further optimized for actual applications. Also in some future posts I will have to adapt batch normalization from Day 4 to the latest Massiv.
Thanks. I'll have a look to see how easy it would be.
You might also add https://crypto.stanford.edu/~blynn/haskell/brain.html to your bibliography. It runs fast enough for a demo using only haskell lists.
Hmmm. I haven't found any massiv notes that explain what changes are needed to get from 0.3 to 0.4 of massiv. Since one of the changes appears to be to change what some of the basic multiplication and addition operators do between Arrays this makes it hard to make mechanical changes (which is what is needed, particularly in the absence of raw material specifying what some of the functions should do (eg, are we multiplying point wise, or is it matrix multiplication?).
massiv seems to be a great library, in terms of the sheer breadth of its functionality, but it needs a simple wrapper or wrappers for some of the main use cases (eg, it might just provide a drop-in replacement for Data.Array. (Leaving the detailed APIs open, for those that need that last ounce of performance.) That might help adoption, otherwise the learning curve to get basic stuff running is quite steep.
We discussed this issue with Alexey at massiv's channel: https://gitter.im/haskell-massiv/Lobby.
Data.Massiv.Array.Numeric
API was not stable, so it was changed in v.0.4.
@jrp2014 Just to help you, all operations like .-
, .+
, etc. were point-wise subtraction, addition, etc. The changes between 0.3 to 0.4:
1) Naming (e.g. .-
becomes .-.
)
2) Operations become monadic. Namely, you are encouraged to use Maybe
monad. This was done to prevent e.g. addition of two arrays with different number of elements.
3) You need delayed arrays to perform the operations (for the details please ask Alexey directly).
For example, see v0.4 https://github.com/masterdezign/10-days-of-grad/blob/master/day5/src/NeuralNetwork.hs#L176:
x + y = maybe (error $ "Dimension mismatch " ++ show (size x, size y)) compute (delay x .+. delay y)
This is equivalent to v0.3 x .+ y
.
Similarly, matrix multiplication |*|
in v0.4 would look like
maybe (error $ "Dimension mismatch " ++ show (size x, size w)) id (x |*| w)
See https://github.com/masterdezign/10-days-of-grad/blob/master/day5/src/NeuralNetwork.hs#L355
You can also use throw monad to sequence several operations, see this example.
The upgrade to v0.4 is a bit tedious indeed. I hope the notes above will be useful.
Thanks. That is helpful. I’ll have another look.
Thanks. Those hints got Day4 compiling (and seemingly working). In fact, instead of using x .+y
I can seem to use delay x + delay y
. I suspect that some of the delays may be superfluous, but I'll do a bit of tidying and let you have a look.
Here's a first cut. https://github.com/jrp2014/10-days-of-grad/blob/master/day4/src/NeuralNetwork.hs
I need to remove some superfluous delays, as a next step. Some results below (I bailed before it finished.)
You are using an unsupported version of LLVM! Currently only 7 is supported. System LLVM version: 8.0.0 We will try though... Linking /home/jrp/Projects/neural/10-days-of-grad/day4/dist-newstyle/build/x86_64-linux/ghc-8.8.2/batchnorm-0.0.0/x/mnist/build/mnist/mnist ... SGD + batchnorm 1 Training accuracy 96.3 Validation accuracy 95.8 2 Training accuracy 98.2 Validation accuracy 97.0 3 Training accuracy 99.0 Validation accuracy 97.6 4 Training accuracy 99.1 Validation accuracy 97.5 5 Training accuracy 99.4 Validation accuracy 97.6 6 Training accuracy 99.2 Validation accuracy 97.2 7 Training accuracy 99.6 Validation accuracy 97.8 8 Training accuracy 99.8 Validation accuracy 98.0 9 Training accuracy 99.9 Validation accuracy 98.1 10 Training accuracy 100.0 Validation accuracy 98.2 SGD 1 Training accuracy 11.6 Validation accuracy 11.8 2 Training accuracy 17.4 Validation accuracy 17.7 ^C 147,153,563,528 bytes allocated in the heap 7,107,568,216 bytes copied during GC 631,440,976 bytes maximum residency (123 sample(s)) 81,613,056 bytes maximum slop 602 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 68518 colls, 68518 par 30.951s 6.288s 0.0001s 0.0159s Gen 1 123 colls, 122 par 1.164s 0.307s 0.0025s 0.0245s
Parallel GC work balance: 86.64% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.002s ( 0.001s elapsed) MUT time 1560.093s (435.363s elapsed) GC time 32.115s ( 6.595s elapsed) EXIT time 0.001s ( 0.002s elapsed) Total time 1592.211s (441.961s elapsed)
Alloc rate 94,323,589 bytes per MUT second
Productivity 98.0% of total user, 98.5% of total elapsed
I've now removed the superfluous delays and run with/out llvm (on a faster machine):
With ghc only: 0.0.0/x/mnist/build/mnist/mnist ... SGD + batchnorm 1 Training accuracy 94.4 Validation accuracy 93.8 2 Training accuracy 97.9 Validation accuracy 97.2 3 Training accuracy 98.5 Validation accuracy 97.4 4 Training accuracy 99.2 Validation accuracy 97.7 5 Training accuracy 99.4 Validation accuracy 97.8 6 Training accuracy 99.7 Validation accuracy 97.9 7 Training accuracy 99.6 Validation accuracy 97.7 8 Training accuracy 99.9 Validation accuracy 98.1 9 Training accuracy 100.0 Validation accuracy 98.1 10 Training accuracy 100.0 Validation accuracy 98.2 SGD 1 Training accuracy 18.9 Validation accuracy 18.9 2 Training accuracy 19.8 Validation accuracy 19.0 3 Training accuracy 29.2 Validation accuracy 27.5 4 Training accuracy 43.8 Validation accuracy 44.4 5 Training accuracy 55.1 Validation accuracy 55.3 6 Training accuracy 64.6 Validation accuracy 65.1 7 Training accuracy 77.6 Validation accuracy 78.2 8 Training accuracy 82.0 Validation accuracy 82.9 9 Training accuracy 85.5 Validation accuracy 86.0 10 Training accuracy 87.3 Validation accuracy 87.6 220,309,387,248 bytes allocated in the heap 7,443,232,256 bytes copied during GC 634,843,888 bytes maximum residency (162 sample(s)) 80,073,480 bytes maximum slop 605 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 50059 colls, 50059 par 54.752s 3.365s 0.0001s 0.0024s Gen 1 162 colls, 161 par 3.462s 0.294s 0.0018s 0.0541s
Parallel GC work balance: 83.40% (serial 0%, perfect 100%)
TASKS: 34 (1 bound, 33 peak workers (33 total), using -N16)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.006s elapsed) MUT time 1952.410s (173.649s elapsed) GC time 58.214s ( 3.659s elapsed) EXIT time 0.001s ( 0.006s elapsed) Total time 2010.626s (177.320s elapsed)
Alloc rate 112,839,704 bytes per MUT second
Productivity 97.1% of total user, 97.9% of total elapsed
With -fllvm: ⇒ ./run.sh Up to date SGD + batchnorm 1 Training accuracy 97.0 Validation accuracy 96.5 2 Training accuracy 97.9 Validation accuracy 97.1 3 Training accuracy 98.6 Validation accuracy 97.4 4 Training accuracy 99.1 Validation accuracy 97.6 5 Training accuracy 99.1 Validation accuracy 97.5 6 Training accuracy 99.7 Validation accuracy 97.9 7 Training accuracy 98.7 Validation accuracy 97.6 8 Training accuracy 99.6 Validation accuracy 98.0 9 Training accuracy 99.9 Validation accuracy 98.1 10 Training accuracy 100.0 Validation accuracy 98.2 SGD 1 Training accuracy 14.6 Validation accuracy 15.0 2 Training accuracy 17.8 Validation accuracy 18.1 3 Training accuracy 27.9 Validation accuracy 27.6 4 Training accuracy 36.9 Validation accuracy 36.0 5 Training accuracy 47.1 Validation accuracy 45.9 6 Training accuracy 57.1 Validation accuracy 58.5 7 Training accuracy 63.9 Validation accuracy 65.3 8 Training accuracy 72.5 Validation accuracy 73.4 9 Training accuracy 80.3 Validation accuracy 80.6 10 Training accuracy 83.1 Validation accuracy 83.4 220,299,496,232 bytes allocated in the heap 7,477,812,832 bytes copied during GC 635,356,912 bytes maximum residency (162 sample(s)) 79,876,616 bytes maximum slop 605 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 51052 colls, 51052 par 61.526s 3.782s 0.0001s 0.0026s Gen 1 162 colls, 161 par 3.339s 0.282s 0.0017s 0.0526s
Parallel GC work balance: 84.17% (serial 0%, perfect 100%)
TASKS: 34 (1 bound, 33 peak workers (33 total), using -N16)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.007s elapsed) MUT time 1802.476s (173.916s elapsed) GC time 64.865s ( 4.063s elapsed) EXIT time 0.000s ( 0.006s elapsed) Total time 1867.342s (177.992s elapsed)
Alloc rate 122,220,493 bytes per MUT second
Productivity 96.5% of total user, 97.7% of total elapsed
Conclusion: llvm generates better code (as it should, with all the effort going into it), but we need a better algorithm. The one from blynn referenced above seems to get there much faster, using only lists, rather than massiv. It must be doing something different.
Nice work.
Yes, it is possible that there are more optimizations needed (e.g. how you stream the data). Careful profiling might reveal some bottlenecks. In my experience, sometimes a well-placed inline
pragma could substantially accelerate the code.
By the way, using -N16
does not mean faster. Empirically, -N6
often works faster. Can you compare with -N6
and even with no parallelism?
I’ll have a play. But my first port of call would be to get an llvm O2 version of Massiv.
I’ll also try to figure out the difference between blynn’s and your algorithm, as that may make more difference than some micro optimisations.
I’ll see if I can lash up a version of GHc with the profiling libraries.
I’ll also try to figure out the difference between blynn’s and your algorithm, as that may make more difference than some micro optimisations.
(784 + 1) * 30 + (30 + 1) * 10 = 23860
trainable parameters. On the other hand, on Day 4, we use two hidden layers with 300 and 50 neurons, that is (784 + 1) * 300 + (300 + 1) * 50 + (50 + 1) * 10 = 251,060
trainable parameters. Therefore, it is not surprising that our network training will run slower.Thus, I suggest that you compare apples to apples, and report network performance of equivalent architectures.
Thanks. That explains the difference! (Both pieces of code are intended to illustrate concepts, accompanying explanatory articles, so there is no sense in "benchmarking" them against each other.)
I suppose that the question that arises is "Do you need as many trainable parameters as you are using?" or, more precisely, "how do you determine the best number of trainable parameters?". But I suspect that the answer to that is "experience".
Anyway, you are welcome to grab the code, or I can send you a PR, if you'd prefer.
With 30 neurons you will get a simpler model. I suspect that for MNIST task your accuracy would be not as high.
You can definitely send a PR provided that your commit will introduce minimal changes.
204980b80b5bdc14faa9fc3169897606a1c2a252 fixes the compilation issue.
Feel free to open a separate issue for Day 5 speed optimization if that is still relevant.
@jrp2014 You might be pleased to hear that the code runs faster (especially with ghc 8.8.1). See for example this branch for 30 neurons version.
Can you confirm on your machine?
Hi, I've got ghc 8.8.2. On my test rig, I got, for masssiv 0.4.4.0
SGD + batchnorm 1 Training accuracy 96.4 Validation accuracy 96.0 2 Training accuracy 98.2 Validation accuracy 97.1 3 Training accuracy 98.9 Validation accuracy 97.6 4 Training accuracy 99.2 Validation accuracy 97.7 5 Training accuracy 98.9 Validation accuracy 97.5 6 Training accuracy 99.7 Validation accuracy 98.0 7 Training accuracy 99.8 Validation accuracy 98.1 8 Training accuracy 100.0 Validation accuracy 98.3 9 Training accuracy 100.0 Validation accuracy 98.4 10 Training accuracy 100.0 Validation accuracy 98.5 SGD 1 Training accuracy 15.2 Validation accuracy 15.8 2 Training accuracy 25.6 Validation accuracy 25.6 3 Training accuracy 26.5 Validation accuracy 26.5 4 Training accuracy 39.4 Validation accuracy 39.8 5 Training accuracy 53.4 Validation accuracy 53.1 6 Training accuracy 67.2 Validation accuracy 67.5 7 Training accuracy 74.9 Validation accuracy 75.6 8 Training accuracy 78.5 Validation accuracy 79.3 9 Training accuracy 81.8 Validation accuracy 82.4 10 Training accuracy 86.4 Validation accuracy 86.5 212,568,448,192 bytes allocated in the heap 7,195,483,840 bytes copied during GC 631,374,952 bytes maximum residency (164 sample(s)) 81,860,656 bytes maximum slop 602 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 96491 colls, 96491 par 38.410s 7.564s 0.0001s 0.0170s Gen 1 164 colls, 163 par 1.183s 0.310s 0.0019s 0.0202s
Parallel GC work balance: 85.82% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.002s ( 0.001s elapsed) MUT time 2530.638s (699.896s elapsed) GC time 39.593s ( 7.874s elapsed) EXIT time 0.002s ( 0.009s elapsed) Total time 2570.235s (707.781s elapsed)
Alloc rate 83,997,967 bytes per MUT second
Productivity 98.5% of total user, 98.9% of total elapsed
I then changed to massiv 0.4.5.0 and got
SGD + batchnorm 1 Training accuracy 96.5 Validation accuracy 95.9 2 Training accuracy 98.3 Validation accuracy 97.4 3 Training accuracy 98.8 Validation accuracy 97.5 4 Training accuracy 99.2 Validation accuracy 97.8 5 Training accuracy 98.9 Validation accuracy 97.4 6 Training accuracy 99.2 Validation accuracy 97.7 7 Training accuracy 99.8 Validation accuracy 98.1 8 Training accuracy 100.0 Validation accuracy 98.1 9 Training accuracy 100.0 Validation accuracy 98.2 10 Training accuracy 100.0 Validation accuracy 98.3 SGD 1 Training accuracy 11.2 Validation accuracy 11.3 2 Training accuracy 11.3 Validation accuracy 11.4 3 Training accuracy 13.5 Validation accuracy 13.3 4 Training accuracy 42.5 Validation accuracy 42.7 5 Training accuracy 62.8 Validation accuracy 63.4 6 Training accuracy 64.0 Validation accuracy 64.4 7 Training accuracy 77.6 Validation accuracy 77.7 8 Training accuracy 84.6 Validation accuracy 84.9 9 Training accuracy 86.6 Validation accuracy 87.0 10 Training accuracy 87.7 Validation accuracy 88.0 212,565,687,504 bytes allocated in the heap 7,203,143,048 bytes copied during GC 631,629,448 bytes maximum residency (163 sample(s)) 83,386,304 bytes maximum slop 602 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 97080 colls, 97080 par 38.523s 7.856s 0.0001s 0.0160s Gen 1 163 colls, 162 par 1.353s 0.374s 0.0023s 0.0249s
Parallel GC work balance: 85.58% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.002s ( 0.002s elapsed) MUT time 2538.648s (700.728s elapsed) GC time 39.876s ( 8.230s elapsed) EXIT time 0.001s ( 0.001s elapsed) Total time 2578.528s (708.961s elapsed)
Alloc rate 83,731,853 bytes per MUT second
Productivity 98.5% of total user, 98.8% of total elapsed
2570s to 2579s seems a wash, unless I have misconfigured something.
There seem to be several different approaches to these tutorial examples:
These approaches are hard to benchmark, for the reasons that you have mentioned (the design parameters of the net). But it would be good to do some further comparisons.
Ah OK, I see that you use massiv 0.4.5.0 below. So there was no speed improvement for ghc 8.8.2 then.
Thank you for the information!
Many thanks for putting together such a great series.
I am using the latest massiv, and just cabal (3) (rather than stack) in the run files, and the master branch, with ghc 8.8.1.
Day 4 doesn't compile; probably a massiv API change? (see below for a sample)
Day 5 compiles OK (after adding zlib to the linux environment), but it takes tens of minutes to produce a result (I didn't wait beyond the first result of 30).
Looking forward to the next 5 days!
: src/NeuralNetwork.hs:425:38: error: • Couldn't match expected type ‘Float’ with actual type ‘Array D Ix1 Float’ • In the second argument of ‘(.-)’, namely ‘lr
_scale
dGamma’ In the second argument of ‘($)’, namely ‘gamma .- lr_scale
dGamma’ In the expression: (compute $ gamma .- lr_scale
dGamma) | 425 | gamma' = (compute $ gamma .- lr_scale
dGamma) | ^^^^^^^^^^^^^^^^^^src/NeuralNetwork.hs:426:36: error: • Couldn't match expected type ‘Float’ with actual type ‘Array D Ix1 Float’ • In the second argument of ‘(.-)’, namely ‘lr
_scale
dBeta’ In the second argument of ‘($)’, namely ‘beta .- lr_scale
dBeta’ In the expression: (compute $ beta .- lr_scale
dBeta) | 426 | beta' = (compute $ beta .- lr_scale
dBeta) | ^^^^^^^^^^^^^^^^^src/NeuralNetwork.hs:525:27: error: • Couldn't match type ‘Array D Ix2 Float’ with ‘Float’ Expected type: Float Actual type: MatrixPrim D Float • In the second argument of ‘(.-)’, namely ‘mu’ In the first argument of ‘(.^)’, namely ‘(ar .- mu)’ In the second argument of ‘($)’, namely ‘(ar .- mu) .^ 2’ | 525 | r0 = compute $ (ar .- mu) .^ 2 | ^^