openzim / openedx

Open edX (to zim) scraper
GNU General Public License v3.0
8 stars 7 forks source link

Fail on missing resources #160

Closed rgaudin closed 1 year ago

rgaudin commented 4 years ago

As seen in #159, there are cases where we failed to download resources yet succeeded the scraper. We should fail on missing resources.

satyamtg commented 4 years ago

@rgaudin should we fail on all types of resources (which might be a bit tricky to implement as we have some invalid URLs too), or shall we fail on only major resources like videos?

rgaudin commented 4 years ago

Can you describe what are the failures we get currently? Which URLs and the reasons?

satyamtg commented 4 years ago

Yep. Some of them are during the subtitle download for some videos and fail with a 404, due to invalid links in the HTML (as it can be very random). We currently do acknowledge if download was successfull and rewrite the links only if successful downloads took place.

One solution would be to handle this explicitly for different xblocks and types of assets. A better solution would be to fail when we get errors and we have exhausted all retry attempts. But then we need to ensure that the URL exists and is not some random invalid URL due to which we fail the whole scraper.

Moreover, for some links, the content might not be available. An example would be video 8 on https://mooc.phzh.ch/courses/course-v1:PHZH+W-IB+2019_E/9a122b295d484793bbf1a33ab0217a69/ , which has been removed from YouTube, and hence youtube_dl would throw an error.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

benoit74 commented 1 year ago

Would it make sense to allow some failed resources like I did on iFixit. I mean we should probably not fail on first resource missing, but maybe an absolute and/or a relative threshold would make sens, e.g. if more than 10% of resources are missing, it means that we have a significant bug which should fail the scrapper run. Does it makes any sense?