Add basic support for multithreading

KyleMaas commented 4 years ago

First attempt at supporting multithreading for https://github.com/piderman314/bardecoder/issues/3 using a feature called "multithreaded". Currently hardcoded at 4 threads - not sure how this would be best configured.

piderman314 commented 4 years ago

Hey, thanks for looking into this. I'm not sure if you noticed, I have included some rudimentary benchmarks. You can run them with the nightly build of rust:

rustup install nightly cargo +nightly bench --features="benchmark" cargo +nightly bench --features="benchmark,multithreaded"

Some preliminary results:

Without the multithreaded feature test bench::needs_alignment ... bench: 3,408,080 ns/iter (+/- 10,782) test bench::version1_example ... bench: 2,970,022 ns/iter (+/- 47,668) test bench::version3_example2 ... bench: 2,405,380 ns/iter (+/- 29,229)
Multithreaded, 4 threads test bench::needs_alignment ... bench: 4,337,285 ns/iter (+/- 314,294) test bench::version1_example ... bench: 4,288,120 ns/iter (+/- 170,416) test bench::version3_example2 ... bench: 3,057,277 ns/iter (+/- 112,588)
Multithreaded, 16 threads (running a Ryzen 3700X so I do have 8 cores+ht) test bench::needs_alignment ... bench: 5,434,150 ns/iter (+/- 154,244) test bench::version1_example ... bench: 5,290,420 ns/iter (+/- 774,591) test bench::version3_example2 ... bench: 3,863,680 ns/iter (+/- 133,465)

Pretty much as I expected. I see that you are using thread safe structures (Arc, Mutex). For loads of very small tasks this will incur a huge overhead in reference counting, waiting on that mutex etc.

Probably a better way to handle this is to chop the image up in a number of parts equal to the number of threads and have each thread process a chunk of data. Not 100% sure it will be faster than non-threaded, but for sure it will be faster than using a pool ;)

Pools are cool if each task takes a long time (like sending network requests). For this kind of small work it's not well suited.

It's a good idea to keep this behind a feature until it does actually improve performance ;)

P.S. Can you please include a line in the .travis.yml in the script section,

- cargo check --features "multithreaded"

KyleMaas commented 4 years ago

Ah, that's cool, but somewhat unfortunate that it only works on nightly. I'm on Gentoo Linux and cargo and rust are both OS packages. I struggled for a while with trying to get rustup going without running into collisions, but was never successful.

I'll see what I can do to restructure things. I think minimizing the time the mutex is locked would probably help, too - thinking back on it, I think I left it locked for pretty much the entirety of a line's processing.

KyleMaas commented 4 years ago

I should say: cargo and rust are both installed via OS package - I think they're actually from the same system package.

KyleMaas commented 4 years ago

The main reason why this would be beneficial for me is that my current use case is for a project which processes scans of documents with multiple QR codes on a page. As such, it deals with enormous input images. In the process of trying to tackle this, I've learned a couple things:

linescan is not where the bulk of the processing time is spent, so my current efforts aren't actually making that much of a difference.
linescan's locator candidates are apparently dependent on the order in which pixels are processed. I don't know if the refinement process runs differently when started from a different row of the image or what's going on, but when run as multithreaded, sometimes tests will fail. This seems to depend on the number of threads, which leads me to believe it's based on which slice of the image the locator candidate is found in and which slice the locator candidate is thrown out in.

KyleMaas commented 4 years ago

There. That now passes all tests and is slightly faster for the multithreaded case, at least for me. It uses crossbeam directly instead of rayon, and now uses slices and vector extends rather than mutexes.

Still doesn't make as much of a performance difference as I'd like, but it does work now.

KyleMaas commented 4 years ago

Looking through extract, since it's searching from most likely to least likely, I would think the overhead of multithreading would be unwise for images which are mostly flat.

blockedmean is iterating over mutable pixels, which I would think would be faster than trying to work with individual pixels on an image across multiple threads - better to have pixels be mutable than to keep locking and unlocking mutexes.

Anywhere else you see benefiting from multithreading?

piderman314 commented 4 years ago

Yeah so once a locator is found, if it finds another candidate that is very close by (at most the size of the locator away), it will discard it and only keep the first. I can imagine that if you split the image across a locator, then it will find it twice and things will mess up? Not 100% sure, I haven't looked into it very closely.

I did notice that without the multithreaded feature, the second benchmark (version1_example) is consistently slower in this PR. Unfortunately, all benchmarks are still significantly slower with the multithreaded feature enabled.

Can you post an example of an image that you are trying to process? There might be other ways to speed things up, like resizing the image.

I'm thinking about where to add multithreading, but honestly I wouldn't even know. The smaller examples all run in 2-3ms for the entire process, it would be extremely difficult to get some improvement because of the threading overhead. Which in itself also speaks to the speed of Rust ;)

piderman314 commented 4 years ago

In fact if you're trying to process multiple scans of documents, I would highly recommend multithreading the documents since bardecoder should be perfectly thread safe to use.

KyleMaas commented 4 years ago

Yeah, threading across pages is what I had originally planned on doing. But then I saw your issue you posted and figured: "eh, maybe I can try to be helpful here - it might help". :)

Since I don't have Rust nightly, I've been running time over a cargo test. Without multithreading, it's around 30s on my machine. With multithreading, it's around 27s. So, like I said, the difference is very small. It may not even be worth the bother. But I figure it was still worth trying.

KyleMaas commented 4 years ago

If nothing else, it got me more accustomed to the goofy static lifetime issues that come about from trying to use std::thread directly. :)

KyleMaas commented 4 years ago

In testing this again on one of my original test cases, this is currently about 5% slower. This is on a huge image with several QR codes buried in it. There are surely other/better ways to do it, but I'm not sure at this point that multithreading linescan is worth it anyway.

If you want to close this, you can. Otherwise, feel free to grab it and see if you can optimize it further. I think I'll leave the branch, but I don't think I can do too much more to it to make it better with the way it's currently built. The two-step duplicate finding is probably not helping, but I couldn't get it to work properly and consistently with varying thread counts without either doing it that way or spending way more time refining everything before duplicate elimination.

KyleMaas commented 1 year ago

Realized this was still open. I'm going to go ahead and close this.

piderman314 / bardecoder

Add basic support for multithreading #23