openzim / mindtouch

libretexts.org to ZIM scraper
GNU General Public License v3.0
0 stars 1 forks source link

K-12 library has 688 missing images from flexbooks.ck12.org #82

Open benoit74 opened 4 days ago

benoit74 commented 4 days ago

When trying to download images from flexbooks.ck12.org, the scraper is denied access, due to a Cloudfront WAF.

E.g. https://flexbooks.ck12.org/flx/show/THUMB_POSTCARD/image/user%3AY2sxMnNjaWVuY2VAY2sxMi5vcmc./98045-1359163835-22-2-IntPhysC-05-03-Weather-satellite.jpg redirects to https://dr282zn36sxxg.cloudfront.net/datastreams/f-d%3A0e28b5bb5ad0f030c1a8be7f2a189afc410f6a7e4f7ddd541706304e%2BIMAGE_THUMB_POSTCARD_TINY%2BIMAGE_THUMB_POSTCARD_TINY.1

The consequence is that some images are missing in the ZIM (688 out of ~ 15k, 4%, not negligible).

In local tests with curl, it looks like passing User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0 as header is sufficient to not (immediately?) trigger Cloudfront protections.

benoit74 commented 4 days ago

There is also 85 assets from www.ck12.org domain; and 1 from img2.ck12.org