millardjn / rusty_sr

Deep learning superresolution in pure rust
204 stars 20 forks source link

Hosted rusty_sr #7

Closed anowell closed 7 years ago

anowell commented 7 years ago

So I finally managed to get this live on Algorithmia: Algorithmia

If you're interested in owning/maintaining it, I'm happy to help make that happen. Our process of transferring algorithms needs a bit of work, but for now, I have a mirror of the repo on github, so it should be possible to:

  1. Create a new rust algorithm yourself from the website
  2. Clone it: git clone https://git.algorithmia.com/git/USERNAME/ALGONAME.git
  3. Merge in my mirror (my git-fu here isn't beautiful):
git add remote original https://github.com/anowell/enhance_resolution.git
get fetch enhance_resolution
git merge -s recursive -X theirs original/master --allow-unrelated-histories
# fix algorithmia.conf and Cargo.toml to have your username & algoname
  1. Make sure it builds, and then push it
  2. Update description, publish it, add sample input, etc..
  3. Let me know, and I'll effectively hide my implementation from the marketplace and point it at your version

(and feel free to just close this if you're not interesting in owning/maintaining it on Algorithmia)

millardjn commented 7 years ago

Hi, Thanks for all the help, I've got the copy up. https://algorithmia.com/algorithms/zza/EnhanceResolution I already have some extensions in the works!

A couple of questions regarding optimizing compilation though: What minimum hardware is this generally running on? single/multicore, SSE2/AVX/AVX2 availability. Is it possible to set compiler flags? Are there any already in the RUSTFLAGS env variable?

Cheers!

anowell commented 7 years ago

Sweet! I've effectively de-listed my implementation from the marketplace (still exists, but soon it will stop showing up in listings).

We don't currently allow setting compiler flags yourself, but your questions reminded me that while we did set some CFLAGS/CXXFLAGS, I neglected to set RUSTFLAGS which if I understand correctly defaults to pentium4 arch for x86_64 getting you SSE2. I can easily set --target-cpu to core2 (matching our CFLAGS -march) which adds SSE3, but the public marketplace does currently leverage multicore haswell/broadwell CPUs, so I'll check with the team on feasibility of bumping that to haswell (getting you AVX/AVX2).

anowell commented 7 years ago

Question: what sort of improvements do you see with SSE2/AVX/AVX2? Using my i5 broadwell laptop and a 1MP photo with rusty_sr, I can't see any noticeable difference in perf between the default, pentium4, haswell, or native (broadwell) cpu-target. Perhaps the perf benefits really only affect training?

millardjn commented 7 years ago

Thanks, for the marketplace hardware/config info!

Going from target-cpu=snb to no flags I'm getting a ~50% reduction in matrixmultiply throughput (AVX to SSE2), but only a ~15% reduction in rusty_sr throughput. Goings from AVX to AVX2 is theoretically another doubling in matmult performance, but the llvm codegen for FMA instructions has been a bit flaky.

There is some inverse Amdahl's going on here, which I think is partially down to the lack of adaptive lowering from convolution to matmult, which currently produces very long, thin (inefficient) matrices. Anyway, I can measure a difference, but its not going to be huge for rusty_sr until I fix a bunch of other stuff.