openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
37 stars 2 forks source link

Crawling issue on courses.lumenlearning.com_en_all #1008

Open benoit74 opened 7 months ago

benoit74 commented 7 months ago

Zimfarm recipe: https://farm.openzim.org/recipes/courses.lumenlearning.com_en_all

We have a crawling issue on this recipe, the crawl retrieves only very few pages and it looks like all courses are missing.

benoit74 commented 7 months ago

As supposed, this is a scope issue.

The current configuration scraped only these pages:

https://courses.lumenlearning.com/catalog/boundlesscourses
https://courses.lumenlearning.com/wp-login.php?redirect_to=https%3A%2F%2Fcourses.lumenlearning.com%2Fcatalog%2Fboundlesscourses
https://courses.lumenlearning.com/
https://courses.lumenlearning.com/boundless-accounting
https://courses.lumenlearning.com/boundless-algebra
https://courses.lumenlearning.com/boundless-ap
https://courses.lumenlearning.com/boundless-arthistory
https://courses.lumenlearning.com/boundless-biology
https://courses.lumenlearning.com/boundless-business
https://courses.lumenlearning.com/boundless-calculus
https://courses.lumenlearning.com/boundless-chemistry
https://courses.lumenlearning.com/boundless-communications
https://courses.lumenlearning.com/boundless-economics
https://courses.lumenlearning.com/boundless-finance
https://courses.lumenlearning.com/boundless-management
https://courses.lumenlearning.com/boundless-marketing
https://courses.lumenlearning.com/boundless-microbiology
https://courses.lumenlearning.com/boundless-physics
https://courses.lumenlearning.com/boundless-politicalscience
https://courses.lumenlearning.com/boundless-psychology
https://courses.lumenlearning.com/boundless-sociology
https://courses.lumenlearning.com/boundless-statistics
https://courses.lumenlearning.com/boundless-ushistory
https://courses.lumenlearning.com/boundless-worldhistory
https://courses.lumenlearning.com/boundless-writing
https://courses.lumenlearning.com/wp-login.php?action=lostpassword
https://courses.lumenlearning.com/wp-login.php?redirect_to=https%3A%2F%2Fcourses.lumenlearning.com
https://courses.lumenlearning.com/lumencollegesuccessxtraining2/
https://courses.lumenlearning.com/lumencollegesuccessxtraining3/
https://courses.lumenlearning.com/wp-login.php

If we go to https://courses.lumenlearning.com/catalog/boundlesscourses and open any course, we can realize that all courses are hosted on other domains than courses.lumenlearning.com.

In the past we included coursehero.com but it looks like no more courses are hosted there.

The current full list of micro websites for all courses are:

https://quillbot.com/courses/introduction-to-college-level-writing/
https://www.collegesidekick.com/study-guides/boundless-arthistory
https://www.collegesidekick.com/study-guides/boundless-chemistry
https://www.collegesidekick.com/study-guides/boundless-physics
https://www.collegesidekick.com/study-guides/boundless-politicalscience
https://www.collegesidekick.com/study-guides/boundless-psychology
https://www.collegesidekick.com/study-guides/boundless-ushistory
https://www.collegesidekick.com/study-guides/boundless-worldhistory
https://www.coursesidekick.com/accounting/study-guides/boundless-accounting
https://www.coursesidekick.com/business/study-guides/boundless-business
https://www.coursesidekick.com/communications/study-guides/boundless-communications
https://www.coursesidekick.com/economics/study-guides/boundless-economics
https://www.coursesidekick.com/finance/study-guides/boundless-finance
https://www.coursesidekick.com/management/study-guides/boundless-management
https://www.coursesidekick.com/marketing/study-guides/boundless-marketing
https://www.coursesidekick.com/mathematics/study-guides/boundless-algebra
https://www.coursesidekick.com/sociology/study-guides/boundless-sociology
https://www.coursesidekick.com/statistics/study-guides/boundless-statistics
https://www.nursinghero.com/study-guides/boundless-ap
https://www.nursinghero.com/study-guides/boundless-biology
https://www.nursinghero.com/study-guides/boundless-microbiology
https://www.symbolab.com/study-guides/boundless-calculus

So I will update the include and exclude regexps.

Current scopes:

New tentative scopes:

Looks like these new scopes will be permissive enough to capture all expected courses and their subpages, potentially automatically include new boundless courses that might be added on the same platforms, and avoid to include all courses of these platforms (we only want boundless ones).

benoit74 commented 7 months ago

Include should obviously also match the source website URLs ... (which are then redirected to external ones)

New include: ^https:\/\/.*(lumenlearning|quilbot|collegesidekick|coursesidekick|nursinghero|symbolab)\.com\/.*(boundless|introduction-to-college-level-writing).*$

kelson42 commented 7 months ago

@benoit74 Working fine at https://library.kiwix.org/viewer#courses.lumenlearning.com_en_all_2021-03/A/courses.lumenlearning.com/catalog/boundlesscourses. Not sure what it means...

benoit74 commented 7 months ago

At that time all courses were hosted on the courses.lumenlearning.com domain ; this is not the case anymore, courses are hosted on www.coursesidekick.com, www.collegesidekick.com, ...

That been said, even with proper config above the scrape is failing to proceed correctly.

We are now blocked by an anti-bot protection "Imperva" on all "subsites"

This is a screenshot of what browsertrix crawler finally get: image

However, even if we will probably never achieve to scrape courses.lumenlearning.com anymore due to this problem it looks to me that we nevertheless have a warc2zim issue because I do not achieve to see this image (this screenshot is made directly from the screencasting functionnality of browsertrix) while browsing the ZIM (while it is supposed to be captured in the WARCs). I keep getting errors that the content is not inside the ZIM. Could it be due to the fact that all these pages are issuing 301 Redirect responses? Or because the pages are on a different domain? @mgautierfr could you have a look with this task configuration: https://farm.openzim.org/pipeline/5a1ec390-df9e-430b-bffa-f0e684a5bb1d ?

benoit74 commented 7 months ago

Edit: for quilbot.com, we are blocked by Cloudflare instead of Imperva 😭

image

benoit74 commented 7 months ago

Any chance we might contact someone from Lumen/Boundless who can configure Imperva + Cloudflare to whitelist our IP?

benoit74 commented 3 months ago

I've moved this to zim-request since this is not a new scraper problem

Popolechien commented 3 months ago

Any chance we might contact someone from Lumen/Boundless who can configure Imperva + Cloudflare to whitelist our IP?

I missed that one. Does not hurt to ask I guess. @RavanJAltaie can you handle it please?

benoit74 commented 3 months ago

I tried to run it again right now and it failed again, I mean, the ZIM is mostly empty due to Imperva blocking: https://farm.openzim.org/pipeline/70cfac17-a663-4bbf-94e1-6ba36af86c87