openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
347 stars 24 forks source link

EatingWell zimit attempts fail #224

Closed DaveD5501 closed 12 months ago

DaveD5501 commented 1 year ago

I double-checked the URL and zimit gave a failure on two attempts.

I was using the default options / settings.

URL: https://www.eatingwell.com/

benoit74 commented 1 year ago

I will have a look on it. Sorry about that. Do you know if the site is "protected" by a CDN or something like that?

DaveD5501 commented 1 year ago

Thank you. I appreciate what you are doing.

"Do you know if the site is "protected" by a CDN or something like that?" I do not know - I had to lookup "CDN".

Each time, Zimit took a long time working on the site before issuing the Failure.

Dave

benoit74 commented 1 year ago

Hello,

I confirm there is an issue most probably linked to some form of protection against denial of service on the website, and this prevents us from proceeding with the content retrieval.

I tested on a server and on my desktop and got the following result:

I suspect this is linked to some kind of ASN / IP filtering.

I will confirm with the team what we usually do in such a situation.

benoit74 commented 1 year ago

It is indeed worse than expected. It does not work anymore on my desktop machine. Looks like protection is pretty aggressive.

DaveD5501 commented 1 year ago

Ok, Thanks for looking into this for me.

I appreciate your time.

Dave

benoit74 commented 1 year ago

I diagnosed this a bit further.

This morning I can request the website via curl again, so clearly there is some kind of active protection.

I achieved to start the website crawl (which is the first phase before creating the ZIM) successfully by passing a User Agent.

@DaveD5501 : could you please try again to request the ZIM on youzim.it and use the "advanced option".

image

In the option named "User Agent", input a valid User Agent from a Browser. You might for instance use mine which is "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15" (without the quotes).

image

This should be enough to make the crawl start successfully, I will monitor this, should it fail I will keep you informed here.

DaveD5501 commented 1 year ago

I tried it again using your User Agent setting. Rather quickly, I got a failure "Youzim.it task 84262 failed".

Thanks for your efforts.

Dave

benoit74 commented 1 year ago

I'm very sorry about that, there is a bug in handling parameters values with spaces.

We will have to fix this before you can proceed, you can track this upstream here: https://github.com/openzim/zimfarm/issues/847

benoit74 commented 1 year ago

Oh, no, I'm wrong, parameter without quotes is processed properly.

It is just that the User-Agent value is still not working. I will have a look at it on Monday, I suspect I already know the reason.

DaveD5501 commented 1 year ago

Ok, thanks for the update.

benoit74 commented 1 year ago

By fixing #226, I finally got the real issue, the server does not like our check of the root URL:

failed to connect to https://www.eatingwell.com/: 406 Client Error: Not Acceptable for url: https://www.eatingwell.com/

Fixing #227 will solve the issue.

rgaudin commented 1 year ago

While doing this, we may also consider making a GET request, stopping after the first bytes received. Some web servers don't implement HEAD and the scrape would fail for an invalid reason.

benoit74 commented 1 year ago

While doing this, we may also consider making a GET request, stopping after the first bytes received. Some web servers don't implement HEAD and the scrape would fail for an invalid reason.

Tracked as https://github.com/openzim/zimit/issues/230 now

benoit74 commented 1 year ago

@DaveD5501 now that code has been merged, could you please try to request the ZIM again (no additional parameter needed anymore, just provide the URL)? I tried to start the task and it worked well, not sure it will complete successfully but at least it made much more progress than before.

Thank you for your patience (and thank you for helping us fix all these issues, we made a good progress)

DaveD5501 commented 12 months ago

Success! I have my ZIM. I clicked around enough to verify the results.

It took 12 hours and finished overnight. I watched it wait for a slot for several hours - so the long time to complete might have been due to slot availability.

THANK YOU very much.

benoit74 commented 12 months ago

Great it finally worked, and thank you again for reporting this and for your patience, it helped us solve an issue probably faced by other users.

benoit74 commented 12 months ago

And btw, yes, the 12 hours is just "bad luck". There was other requests in the pipe, and lot of them also took significant time to complete ; you could request it down it would start immediately to process ^^