Fix missing links due to delayed parser events

rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.

https://rajatomar788.github.io/pywebcopy/

Other

520 stars 105 forks source link

Fix missing links due to delayed parser events #105

Closed monim67 closed 1 year ago

monim67 commented 1 year ago

Sometimes the HTMLPullParser does not dispatch all events until it is closed (lxml bug report). As pywebcopy does not process those delayed events often few HTML links are not even crawled. This leads to missing resources while downloading webpages. For me no linked HTML pages were downloaded (#63).

This PR includes:

Fix to this issue by processing those delayed events after parser is closed.
Test cases to test the exact scenario.

rajatomar788 commented 1 year ago

@monim67 what does it exactly do, other than unquoting an already existing code? It was commented out for some reason obviously.

monim67 commented 1 year ago

The commented out code parses other elements besides the html and body elements added by parser. You can check it out by the test I added. You can see the test fail if you comment the code again.

I couldn't find why this code was commented out, please share if you have it.

monim67 commented 1 year ago

This is not unnecessary, it fails to download few pages during website download. Have you checked the test I added? It reproduces this issue.

rajatomar788 commented 1 year ago

ok I would add this to the testing routine of the next deployment.

monim67 commented 1 year ago

Is there any plan to publish the fix soon?