Introduce multithreading

satyamtg commented 4 years ago

We can eaisly support multithreading here by having multiple threads for for the download method of the xblock_extractor objects. However, we do have videos from youtube_dl which need to be in a separate queue (as that's throttled). So, I think we need to handle that in a good way here as multithreading drastically improves performance of this very scraper. Maybe we can have a main multithreaded process (because it has many HTTP requests) and handle youtube separately.

rgaudin commented 4 years ago

Agrees. Thanks for your experiments with multiprocessing.

This is very similar to other scrapers in that we have concurrent usages:

long cpu-intensive stuff we don't want to supervise (ffmpeg)
cpu-intensive stuff we want to supervise (images optimization)
unthrottled downloads
throttled downloads
unthrottled uploads

It's a lot of requirements that calls for flexibility. Also, we definitely want to assess our S3 performance before getting into this as we need to know where are the bottlenecks and which methods delivers best for those download/upload use cases.

This all renders this quite complex which is why I think we shall attempt to solve it on a less fragile scraper (youtube?) first and document/replicate onto others.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

openzim / openedx

Introduce multithreading #63