prajnak / img_preprocess

0 stars 0 forks source link

How do you find bottlenecks in current code at scale? #1

Open prats226 opened 6 years ago

prats226 commented 6 years ago

Suppose you scaled the current solution with a large number of datasets and images, how will you find out bottlenecks in current solution?

prajnak commented 6 years ago

We'd need to profile the program to figure out which of (download, resize, upload) takes the most time. Python has a timeit module that can be very handy for figuring out things like this. Anyway, since we use ThreadPoolExecutor, which is highly optimized for I/O heavy workloads, I the bottleneck in our case is the image processing steps, which can be further optimized by moving away from an easy to use framework like Pillow to something like OpenCV or Imagemagick - to be decided after benchmarking each of these library for our usecase.

prajnak commented 6 years ago

The biggest bottleneck when we use Python(cython) is the GIL - Global Interpreter Lock, which can really mess up performance with multiple threading vying for the lock. However, as I've mentioned earlier, I/O heavy loads don't lock the GIL in Python and thus, the ThreadPoolExecutor is a nice choice for our use case.

It also makes sense to test out the performance ProcessPoolExecutor for the same workload, but we'd need a beefier EC2 instance(more cores) to get that to work well. ProcessPoolExecutor doen't suffer from GIL performance issues but it doesn't matter much for our I/O heavy workload