openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
273 stars 72 forks source link

Increase overall speed by bringing back non-sequential media files scraping #836

Open kelson42 opened 5 years ago

kelson42 commented 5 years ago

By introducing the usage of node-libzim we have made the text & media retrieval sequential. That means that we first retrieve all the article texts and then all the media files. This has been done because we needed to know the maximal resolution needed for each picture and we thought we had no way to know that before scraping the last article.

This works fine, but the problem of this is that it slows done the overall process because the image retrieval/optimisation process takes a lot of CPU (and the text retrieval almost none). This part is at least twice as long as the text retrieval (in my experience). If this could have been done at the same time like the text content retrieval then the overall scraping time would be shorter.

To improve that, what could be done is to retrieve information about linked pictures at the same time as we retrieve redirects, coordinates, etc... about articles. That means at the beginning of the process. We could do that without making more API requests, just by adding prop=images like in this example.

By doing so, we know in which articles we have which images at the start of the process. That way we can know, at the end of an article scraping, for each image included, not only the size of the pictures but also if we need to scrape an other article to know more about the max resolution needed for that picture. We just need to check if we have scraped all the article including that picture... and if this is the case, then we know the max resolution and we can straight start with the media scraping.

By doing so, reaching the last article to scrape (text) will also mean that there is not pictures anymore to scrape.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 1 year ago

@uriesk Now that your are more familliar with the overall workflow, its requirements and challenges.This old ticket might be an interesting reading, in particular in relation to #1199.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.