rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
535 stars 108 forks source link

Unreliable iterator based incremental parsing #123

Open TLCFEM opened 6 months ago

TLCFEM commented 6 months ago

https://github.com/rajatomar788/pywebcopy/blob/9f35b4b6a4da2125e70d8f7a21100de1f09012f4/pywebcopy/parsers.py#L104

Here if it breaks between a href, nothing will be further parsed.

See example:

Wrong:

    from lxml import etree

    parser = etree.HTMLPullParser()
    for data in (b'<root><a href="2011-03-13_',  b'135411/">2011-03-13_135411/</a></root>',):
        parser.feed(data)
        for _, elem in parser.read_events():
            print(elem.tag) # nothing
    parser.close()

Expected:

    from lxml import etree

    parser = etree.HTMLPullParser()
    for data in (b'<root><a href="2011-03-13_135411/">2011-03-13_135411/</a></root>',):
        parser.feed(data)
        for _, elem in parser.read_events():
            print(elem.tag) # a root
    parser.close()

It may be better just to feed all at once.

        parser.feed(source.fp.data)
        for event, element in parser.read_events():
            for child in links(element):
                if child is None:
                    continue
                yield child
rajatomar788 commented 6 months ago

@TLCFEM there is another loop outside the while loop for cases like this which iterates any leftover tags. And the chunks being fed are large enough to avoid this problem.

TLCFEM commented 6 months ago

@TLCFEM there is another loop outside the while loop for cases like this which iterates any leftover tags. And the chunks being fed are large enough to avoid this problem.

Please try this link: https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/

The saved html:

<html>
<!--
* PyWebCopy Engine [version 7.0.2]
* Copyright 2020; Raja Tomar
* File mirrored from [https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/]
* At UTC datetime: [2024-03-24 17:40:10.070531]
--><head><title>Index of /seismic-products/strong-motion/volume-products/2011/</title></head>
<body>
<h1>Index of /seismic-products/strong-motion/volume-products/2011/</h1><hr><pre><a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/">../</a>
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/01_Jan/">01_Jan/</a>                                            24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/02_Christchurch_mainshock_extended_pass_band/">02_Christchurch_mainshock_extended_pass_band/</a>      24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/02_Feb/">02_Feb/</a>                                            24-Mar-2024 17:19                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/03_Mar/">03_Mar/</a>                                            24-Mar-2024 17:20                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/04_Apr/">04_Apr/</a>                                            24-Mar-2024 17:26                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/05_May/">05_May/</a>                                            24-Mar-2024 17:05                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/06_Christchurch_13_June_extended%20pass%20band/">06_Christchurch_13_June_extended pass band/</a>        24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/06_Jun/">06_Jun/</a>                                            24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/07_Jul/">07_Jul/</a>                                            24-Mar-2024 17:26                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/08_Aug/">08_Aug/</a>                                            24-Mar-2024 17:29                   -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/09_Sep/">09_Sep/</a>                                            24-Mar-2024 17:29                   -
<a href="10_Oct/">10_Oct/</a>                                            24-Mar-2024 17:22                   -
<a href="11_Nov/">11_Nov/</a>                                            24-Mar-2024 17:29                   -
<a href="12_Dec/">12_Dec/</a>                                            24-Mar-2024 17:16                   -
</pre><hr></body>
</html>

The last three are broken in this example. As far as I can tell, many links are broken due to this issue for sites like this.

TLCFEM commented 6 months ago

there is another loop outside the while loop for cases like this which iterates any leftover tags.

And just to be clear, it is not caused by not fully fed, incomplete data. So the iterator itself is fine.

If the break is outside href, then it is working fine.

from lxml import etree

parser = etree.HTMLPullParser()
#                                             |  here please note the difference
for data in (b'<root><a href=', b'"2011-03-13_135411/">2011-03-13_135411/</a></root>',):
      parser.feed(data)
      for _, elem in parser.read_events():
            print(elem.tag)  # a root
parser.close()
rajatomar788 commented 6 months ago

alright then will change to one time feeding.

TLCFEM commented 6 months ago

It turns out to be a bug in libxml, see this: https://bugs.launchpad.net/lxml/+bug/2058828

Maybe check etree.LIBXML_VERSION, and provide the one-off alternative for versions < 2.11.

rajatomar788 commented 6 months ago

Will try to do it.