Improvements suggestion

geor-kasapidi commented 2 years ago

hi, @madebyollin ! in article you have asked about level1 optimisation flag - please, take a look at this code. I recommend you to compile your MPSGraph instances using suggested approach. Compiled version is usually faster :)

geor-kasapidi commented 2 years ago

also, based on my experience, NCHW image tensors is slightly faster comparing to NHWC data layout. But be careful - MPSImage conversion to MPSGraphTensorData requires double tensor transposition. And these transpositions are better to perform as as separated graph - I've experienced a performance hit if I insert transpositions after placeholder.

madebyollin commented 2 years ago

Thanks for the tips! Turning on the level1 flag seems to have mysterious results (I tried it just on part 3 of the UNet to start):

Huge delay during start of run of the level1-compiled executable
Eventual segfault (Thread 8: EXC_BAD_ACCESS (code=1, address=0x440404410c010880)) in the level1-compiled executable's run

I can try to narrow down further and see if it works on some smaller subgraph, I guess.

I also do need to try NCHW - thanks for the suggestion! Checking my notes, I think I had assumed NHWC would be faster without ever verifying. I don't need to do any conversion to MPSImage AFAIK - but I will need to change the permutes used during self/cross-attention.

madebyollin / maple-diffusion

Improvements suggestion #6