Day 4 does not compile, Day 5 v slow

jrp2014 commented 4 years ago

Many thanks for putting together such a great series.

I am using the latest massiv, and just cabal (3) (rather than stack) in the run files, and the master branch, with ghc 8.8.1.

Day 4 doesn't compile; probably a massiv API change? (see below for a sample)

Day 5 compiles OK (after adding zlib to the linux environment), but it takes tens of minutes to produce a result (I didn't wait beyond the first result of 30).

Looking forward to the next 5 days!

: src/NeuralNetwork.hs:425:38: error: • Couldn't match expected type ‘Float’ with actual type ‘Array D Ix1 Float’ • In the second argument of ‘(.-)’, namely ‘lr _scale dGamma’ In the second argument of ‘($)’, namely ‘gamma .- lr _scale dGamma’ In the expression: (compute $ gamma .- lr _scale dGamma) | 425 | gamma' = (compute $ gamma .- lr _scale dGamma) | ^^^^^^^^^^^^^^^^^^

src/NeuralNetwork.hs:426:36: error: • Couldn't match expected type ‘Float’ with actual type ‘Array D Ix1 Float’ • In the second argument of ‘(.-)’, namely ‘lr _scale dBeta’ In the second argument of ‘($)’, namely ‘beta .- lr _scale dBeta’ In the expression: (compute $ beta .- lr _scale dBeta) | 426 | beta' = (compute $ beta .- lr _scale dBeta) | ^^^^^^^^^^^^^^^^^

src/NeuralNetwork.hs:525:27: error: • Couldn't match type ‘Array D Ix2 Float’ with ‘Float’ Expected type: Float Actual type: MatrixPrim D Float • In the second argument of ‘(.-)’, namely ‘mu’ In the first argument of ‘(.^)’, namely ‘(ar .- mu)’ In the second argument of ‘($)’, namely ‘(ar .- mu) .^ 2’ | 525 | r0 = compute $ (ar .- mu) .^ 2 | ^^

masterdezign commented 4 years ago

Thank you, I appreciate your feedback.

1) Yes, there was a large change to Massiv API. Day 4 was written with massiv-0.3.2.1, whereas Day 5 already relies on massiv-0.4.4.0 (see comments in https://github.com/masterdezign/10-days-of-grad/blob/master/day5/src/NeuralNetwork.hs#L3).

Several possible ways how to resolve the compilation issue:

Use stack (the quickest option).
Not use stack, but force massiv version to 0.3.2.1 (see exact dependencies in https://github.com/masterdezign/10-days-of-grad/blob/master/day4/stack.yaml).
Upgrade Day 4 to the latest Massiv.

I don't think I will have time to upgrade the code to the latest Massiv. But pull requests are very welcome ;)

2) That is correct, Day 5 needs some profiling to be done. Again, depends if I have some extra time.

The posts are intended to be educational, so the code can be further optimized for actual applications. Also in some future posts I will have to adapt batch normalization from Day 4 to the latest Massiv.

jrp2014 commented 4 years ago

Thanks. I'll have a look to see how easy it would be.

You might also add https://crypto.stanford.edu/~blynn/haskell/brain.html to your bibliography. It runs fast enough for a demo using only haskell lists.

jrp2014 commented 4 years ago

Hmmm. I haven't found any massiv notes that explain what changes are needed to get from 0.3 to 0.4 of massiv. Since one of the changes appears to be to change what some of the basic multiplication and addition operators do between Arrays this makes it hard to make mechanical changes (which is what is needed, particularly in the absence of raw material specifying what some of the functions should do (eg, are we multiplying point wise, or is it matrix multiplication?).

massiv seems to be a great library, in terms of the sheer breadth of its functionality, but it needs a simple wrapper or wrappers for some of the main use cases (eg, it might just provide a drop-in replacement for Data.Array. (Leaving the detailed APIs open, for those that need that last ounce of performance.) That might help adoption, otherwise the learning curve to get basic stuff running is quite steep.

masterdezign commented 4 years ago

We discussed this issue with Alexey at massiv's channel: https://gitter.im/haskell-massiv/Lobby. Data.Massiv.Array.Numeric API was not stable, so it was changed in v.0.4.

masterdezign commented 4 years ago

@jrp2014 Just to help you, all operations like .-, .+, etc. were point-wise subtraction, addition, etc. The changes between 0.3 to 0.4:

1) Naming (e.g. .- becomes .-.) 2) Operations become monadic. Namely, you are encouraged to use Maybe monad. This was done to prevent e.g. addition of two arrays with different number of elements. 3) You need delayed arrays to perform the operations (for the details please ask Alexey directly).

For example, see v0.4 https://github.com/masterdezign/10-days-of-grad/blob/master/day5/src/NeuralNetwork.hs#L176:

x + y       = maybe (error $ "Dimension mismatch " ++ show (size x, size y)) compute (delay x .+. delay y)

This is equivalent to v0.3 x .+ y.

masterdezign commented 4 years ago

Similarly, matrix multiplication |*| in v0.4 would look like

maybe (error $ "Dimension mismatch " ++ show (size x, size w)) id (x |*| w)

See https://github.com/masterdezign/10-days-of-grad/blob/master/day5/src/NeuralNetwork.hs#L355

You can also use throw monad to sequence several operations, see this example.

The upgrade to v0.4 is a bit tedious indeed. I hope the notes above will be useful.

jrp2014 commented 4 years ago

Thanks. That is helpful. I’ll have another look.

jrp2014 commented 4 years ago

Thanks. Those hints got Day4 compiling (and seemingly working). In fact, instead of using x .+y I can seem to use delay x + delay y. I suspect that some of the delays may be superfluous, but I'll do a bit of tidying and let you have a look.

jrp2014 commented 4 years ago

Here's a first cut. https://github.com/jrp2014/10-days-of-grad/blob/master/day4/src/NeuralNetwork.hs

I need to remove some superfluous delays, as a next step. Some results below (I bailed before it finished.)

You are using an unsupported version of LLVM! Currently only 7 is supported. System LLVM version: 8.0.0 We will try though... Linking /home/jrp/Projects/neural/10-days-of-grad/day4/dist-newstyle/build/x86_64-linux/ghc-8.8.2/batchnorm-0.0.0/x/mnist/build/mnist/mnist ... SGD + batchnorm 1 Training accuracy 96.3 Validation accuracy 95.8 2 Training accuracy 98.2 Validation accuracy 97.0 3 Training accuracy 99.0 Validation accuracy 97.6 4 Training accuracy 99.1 Validation accuracy 97.5 5 Training accuracy 99.4 Validation accuracy 97.6 6 Training accuracy 99.2 Validation accuracy 97.2 7 Training accuracy 99.6 Validation accuracy 97.8 8 Training accuracy 99.8 Validation accuracy 98.0 9 Training accuracy 99.9 Validation accuracy 98.1 10 Training accuracy 100.0 Validation accuracy 98.2 SGD 1 Training accuracy 11.6 Validation accuracy 11.8 2 Training accuracy 17.4 Validation accuracy 17.7 ^C 147,153,563,528 bytes allocated in the heap 7,107,568,216 bytes copied during GC 631,440,976 bytes maximum residency (123 sample(s)) 81,613,056 bytes maximum slop 602 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 68518 colls, 68518 par 30.951s 6.288s 0.0001s 0.0159s Gen 1 123 colls, 122 par 1.164s 0.307s 0.0025s 0.0245s

Parallel GC work balance: 86.64% (serial 0%, perfect 100%)

TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.002s ( 0.001s elapsed) MUT time 1560.093s (435.363s elapsed) GC time 32.115s ( 6.595s elapsed) EXIT time 0.001s ( 0.002s elapsed) Total time 1592.211s (441.961s elapsed)

Alloc rate 94,323,589 bytes per MUT second

Productivity 98.0% of total user, 98.5% of total elapsed

jrp2014 commented 4 years ago

I've now removed the superfluous delays and run with/out llvm (on a faster machine):

With ghc only: 0.0.0/x/mnist/build/mnist/mnist ... SGD + batchnorm 1 Training accuracy 94.4 Validation accuracy 93.8 2 Training accuracy 97.9 Validation accuracy 97.2 3 Training accuracy 98.5 Validation accuracy 97.4 4 Training accuracy 99.2 Validation accuracy 97.7 5 Training accuracy 99.4 Validation accuracy 97.8 6 Training accuracy 99.7 Validation accuracy 97.9 7 Training accuracy 99.6 Validation accuracy 97.7 8 Training accuracy 99.9 Validation accuracy 98.1 9 Training accuracy 100.0 Validation accuracy 98.1 10 Training accuracy 100.0 Validation accuracy 98.2 SGD 1 Training accuracy 18.9 Validation accuracy 18.9 2 Training accuracy 19.8 Validation accuracy 19.0 3 Training accuracy 29.2 Validation accuracy 27.5 4 Training accuracy 43.8 Validation accuracy 44.4 5 Training accuracy 55.1 Validation accuracy 55.3 6 Training accuracy 64.6 Validation accuracy 65.1 7 Training accuracy 77.6 Validation accuracy 78.2 8 Training accuracy 82.0 Validation accuracy 82.9 9 Training accuracy 85.5 Validation accuracy 86.0 10 Training accuracy 87.3 Validation accuracy 87.6 220,309,387,248 bytes allocated in the heap 7,443,232,256 bytes copied during GC 634,843,888 bytes maximum residency (162 sample(s)) 80,073,480 bytes maximum slop 605 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 50059 colls, 50059 par 54.752s 3.365s 0.0001s 0.0024s Gen 1 162 colls, 161 par 3.462s 0.294s 0.0018s 0.0541s

Parallel GC work balance: 83.40% (serial 0%, perfect 100%)

TASKS: 34 (1 bound, 33 peak workers (33 total), using -N16)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.001s ( 0.006s elapsed) MUT time 1952.410s (173.649s elapsed) GC time 58.214s ( 3.659s elapsed) EXIT time 0.001s ( 0.006s elapsed) Total time 2010.626s (177.320s elapsed)

Alloc rate 112,839,704 bytes per MUT second

Productivity 97.1% of total user, 97.9% of total elapsed

With -fllvm: ⇒ ./run.sh Up to date SGD + batchnorm 1 Training accuracy 97.0 Validation accuracy 96.5 2 Training accuracy 97.9 Validation accuracy 97.1 3 Training accuracy 98.6 Validation accuracy 97.4 4 Training accuracy 99.1 Validation accuracy 97.6 5 Training accuracy 99.1 Validation accuracy 97.5 6 Training accuracy 99.7 Validation accuracy 97.9 7 Training accuracy 98.7 Validation accuracy 97.6 8 Training accuracy 99.6 Validation accuracy 98.0 9 Training accuracy 99.9 Validation accuracy 98.1 10 Training accuracy 100.0 Validation accuracy 98.2 SGD 1 Training accuracy 14.6 Validation accuracy 15.0 2 Training accuracy 17.8 Validation accuracy 18.1 3 Training accuracy 27.9 Validation accuracy 27.6 4 Training accuracy 36.9 Validation accuracy 36.0 5 Training accuracy 47.1 Validation accuracy 45.9 6 Training accuracy 57.1 Validation accuracy 58.5 7 Training accuracy 63.9 Validation accuracy 65.3 8 Training accuracy 72.5 Validation accuracy 73.4 9 Training accuracy 80.3 Validation accuracy 80.6 10 Training accuracy 83.1 Validation accuracy 83.4 220,299,496,232 bytes allocated in the heap 7,477,812,832 bytes copied during GC 635,356,912 bytes maximum residency (162 sample(s)) 79,876,616 bytes maximum slop 605 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 51052 colls, 51052 par 61.526s 3.782s 0.0001s 0.0026s Gen 1 162 colls, 161 par 3.339s 0.282s 0.0017s 0.0526s

Parallel GC work balance: 84.17% (serial 0%, perfect 100%)

TASKS: 34 (1 bound, 33 peak workers (33 total), using -N16)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.001s ( 0.007s elapsed) MUT time 1802.476s (173.916s elapsed) GC time 64.865s ( 4.063s elapsed) EXIT time 0.000s ( 0.006s elapsed) Total time 1867.342s (177.992s elapsed)

Alloc rate 122,220,493 bytes per MUT second

Productivity 96.5% of total user, 97.7% of total elapsed

Conclusion: llvm generates better code (as it should, with all the effort going into it), but we need a better algorithm. The one from blynn referenced above seems to get there much faster, using only lists, rather than massiv. It must be doing something different.

masterdezign commented 4 years ago

Nice work.

Yes, it is possible that there are more optimizations needed (e.g. how you stream the data). Careful profiling might reveal some bottlenecks. In my experience, sometimes a well-placed inline pragma could substantially accelerate the code.

masterdezign commented 4 years ago

By the way, using -N16 does not mean faster. Empirically, -N6 often works faster. Can you compare with -N6 and even with no parallelism?

jrp2014 commented 4 years ago

I’ll have a play. But my first port of call would be to get an llvm O2 version of Massiv.

I’ll also try to figure out the difference between blynn’s and your algorithm, as that may make more difference than some micro optimisations.

I’ll see if I can lash up a version of GHc with the profiling libraries.

masterdezign commented 4 years ago

I’ll also try to figure out the difference between blynn’s and your algorithm, as that may make more difference than some micro optimisations.

Well, they use 30 neurons and they have only a single hidden layer. That is (784 + 1) * 30 + (30 + 1) * 10 = 23860 trainable parameters. On the other hand, on Day 4, we use two hidden layers with 300 and 50 neurons, that is (784 + 1) * 300 + (300 + 1) * 50 + (50 + 1) * 10 = 251,060 trainable parameters. Therefore, it is not surprising that our network training will run slower.
In one of two cases we also apply batch normalization, which adds a bit of complexity.

Thus, I suggest that you compare apples to apples, and report network performance of equivalent architectures.

jrp2014 commented 4 years ago

Thanks. That explains the difference! (Both pieces of code are intended to illustrate concepts, accompanying explanatory articles, so there is no sense in "benchmarking" them against each other.)

I suppose that the question that arises is "Do you need as many trainable parameters as you are using?" or, more precisely, "how do you determine the best number of trainable parameters?". But I suspect that the answer to that is "experience".

Anyway, you are welcome to grab the code, or I can send you a PR, if you'd prefer.

masterdezign commented 4 years ago

With 30 neurons you will get a simpler model. I suspect that for MNIST task your accuracy would be not as high.

You can definitely send a PR provided that your commit will introduce minimal changes.

masterdezign commented 4 years ago

204980b80b5bdc14faa9fc3169897606a1c2a252 fixes the compilation issue.

Feel free to open a separate issue for Day 5 speed optimization if that is still relevant.

masterdezign commented 4 years ago

@jrp2014 You might be pleased to hear that the code runs faster (especially with ghc 8.8.1). See for example this branch for 30 neurons version.

Can you confirm on your machine?

jrp2014 commented 4 years ago

Hi, I've got ghc 8.8.2. On my test rig, I got, for masssiv 0.4.4.0

SGD + batchnorm 1 Training accuracy 96.4 Validation accuracy 96.0 2 Training accuracy 98.2 Validation accuracy 97.1 3 Training accuracy 98.9 Validation accuracy 97.6 4 Training accuracy 99.2 Validation accuracy 97.7 5 Training accuracy 98.9 Validation accuracy 97.5 6 Training accuracy 99.7 Validation accuracy 98.0 7 Training accuracy 99.8 Validation accuracy 98.1 8 Training accuracy 100.0 Validation accuracy 98.3 9 Training accuracy 100.0 Validation accuracy 98.4 10 Training accuracy 100.0 Validation accuracy 98.5 SGD 1 Training accuracy 15.2 Validation accuracy 15.8 2 Training accuracy 25.6 Validation accuracy 25.6 3 Training accuracy 26.5 Validation accuracy 26.5 4 Training accuracy 39.4 Validation accuracy 39.8 5 Training accuracy 53.4 Validation accuracy 53.1 6 Training accuracy 67.2 Validation accuracy 67.5 7 Training accuracy 74.9 Validation accuracy 75.6 8 Training accuracy 78.5 Validation accuracy 79.3 9 Training accuracy 81.8 Validation accuracy 82.4 10 Training accuracy 86.4 Validation accuracy 86.5 212,568,448,192 bytes allocated in the heap 7,195,483,840 bytes copied during GC 631,374,952 bytes maximum residency (164 sample(s)) 81,860,656 bytes maximum slop 602 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 96491 colls, 96491 par 38.410s 7.564s 0.0001s 0.0170s Gen 1 164 colls, 163 par 1.183s 0.310s 0.0019s 0.0202s

Parallel GC work balance: 85.82% (serial 0%, perfect 100%)

TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.002s ( 0.001s elapsed) MUT time 2530.638s (699.896s elapsed) GC time 39.593s ( 7.874s elapsed) EXIT time 0.002s ( 0.009s elapsed) Total time 2570.235s (707.781s elapsed)

Alloc rate 83,997,967 bytes per MUT second

Productivity 98.5% of total user, 98.9% of total elapsed

I then changed to massiv 0.4.5.0 and got

SGD + batchnorm 1 Training accuracy 96.5 Validation accuracy 95.9 2 Training accuracy 98.3 Validation accuracy 97.4 3 Training accuracy 98.8 Validation accuracy 97.5 4 Training accuracy 99.2 Validation accuracy 97.8 5 Training accuracy 98.9 Validation accuracy 97.4 6 Training accuracy 99.2 Validation accuracy 97.7 7 Training accuracy 99.8 Validation accuracy 98.1 8 Training accuracy 100.0 Validation accuracy 98.1 9 Training accuracy 100.0 Validation accuracy 98.2 10 Training accuracy 100.0 Validation accuracy 98.3 SGD 1 Training accuracy 11.2 Validation accuracy 11.3 2 Training accuracy 11.3 Validation accuracy 11.4 3 Training accuracy 13.5 Validation accuracy 13.3 4 Training accuracy 42.5 Validation accuracy 42.7 5 Training accuracy 62.8 Validation accuracy 63.4 6 Training accuracy 64.0 Validation accuracy 64.4 7 Training accuracy 77.6 Validation accuracy 77.7 8 Training accuracy 84.6 Validation accuracy 84.9 9 Training accuracy 86.6 Validation accuracy 87.0 10 Training accuracy 87.7 Validation accuracy 88.0 212,565,687,504 bytes allocated in the heap 7,203,143,048 bytes copied during GC 631,629,448 bytes maximum residency (163 sample(s)) 83,386,304 bytes maximum slop 602 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 97080 colls, 97080 par 38.523s 7.856s 0.0001s 0.0160s Gen 1 163 colls, 162 par 1.353s 0.374s 0.0023s 0.0249s

Parallel GC work balance: 85.58% (serial 0%, perfect 100%)

TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.002s ( 0.002s elapsed) MUT time 2538.648s (700.728s elapsed) GC time 39.876s ( 8.230s elapsed) EXIT time 0.001s ( 0.001s elapsed) Total time 2578.528s (708.961s elapsed)

Alloc rate 83,731,853 bytes per MUT second

Productivity 98.5% of total user, 98.8% of total elapsed

2570s to 2579s seems a wash, unless I have misconfigured something.

There seem to be several different approaches to these tutorial examples:

the blynn approach: simple, clear but inefficient
the same, but using vector (which avoids the need to use exotic libraries, but should give better performance)
this approach, which tries to use a much more capable library, at the expense of making the code less clear
using something like grenade, which uses much more type-safe haskell than arrays / vectors, but relies on C libraries for speed and incorporates the im2col trick, etc.
using better abstractions, such as Conal Elliot's work (which are so far hard to grapple with and slow)

These approaches are hard to benchmark, for the reasons that you have mentioned (the design parameters of the net). But it would be good to do some further comparisons.

masterdezign commented 4 years ago

Ah OK, I see that you use massiv 0.4.5.0 below. So there was no speed improvement for ghc 8.8.2 then.

Thank you for the information!

penkovsky / 10-days-of-grad

Day 4 does not compile, Day 5 v slow #1