Open TLCFEM opened 6 months ago
@TLCFEM there is another loop outside the while loop for cases like this which iterates any leftover tags. And the chunks being fed are large enough to avoid this problem.
@TLCFEM there is another loop outside the while loop for cases like this which iterates any leftover tags. And the chunks being fed are large enough to avoid this problem.
Please try this link: https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/
The saved html:
<html>
<!--
* PyWebCopy Engine [version 7.0.2]
* Copyright 2020; Raja Tomar
* File mirrored from [https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/]
* At UTC datetime: [2024-03-24 17:40:10.070531]
--><head><title>Index of /seismic-products/strong-motion/volume-products/2011/</title></head>
<body>
<h1>Index of /seismic-products/strong-motion/volume-products/2011/</h1><hr><pre><a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/">../</a>
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/01_Jan/">01_Jan/</a> 24-Mar-2024 17:29 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/02_Christchurch_mainshock_extended_pass_band/">02_Christchurch_mainshock_extended_pass_band/</a> 24-Mar-2024 17:29 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/02_Feb/">02_Feb/</a> 24-Mar-2024 17:19 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/03_Mar/">03_Mar/</a> 24-Mar-2024 17:20 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/04_Apr/">04_Apr/</a> 24-Mar-2024 17:26 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/05_May/">05_May/</a> 24-Mar-2024 17:05 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/06_Christchurch_13_June_extended%20pass%20band/">06_Christchurch_13_June_extended pass band/</a> 24-Mar-2024 17:29 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/06_Jun/">06_Jun/</a> 24-Mar-2024 17:29 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/07_Jul/">07_Jul/</a> 24-Mar-2024 17:26 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/08_Aug/">08_Aug/</a> 24-Mar-2024 17:29 -
<a href="https://data.geonet.org.nz/seismic-products/strong-motion/volume-products/2011/09_Sep/">09_Sep/</a> 24-Mar-2024 17:29 -
<a href="10_Oct/">10_Oct/</a> 24-Mar-2024 17:22 -
<a href="11_Nov/">11_Nov/</a> 24-Mar-2024 17:29 -
<a href="12_Dec/">12_Dec/</a> 24-Mar-2024 17:16 -
</pre><hr></body>
</html>
The last three are broken in this example. As far as I can tell, many links are broken due to this issue for sites like this.
there is another loop outside the while loop for cases like this which iterates any leftover tags.
And just to be clear, it is not caused by not fully fed, incomplete data. So the iterator itself is fine.
If the break is outside href
, then it is working fine.
from lxml import etree
parser = etree.HTMLPullParser()
# | here please note the difference
for data in (b'<root><a href=', b'"2011-03-13_135411/">2011-03-13_135411/</a></root>',):
parser.feed(data)
for _, elem in parser.read_events():
print(elem.tag) # a root
parser.close()
alright then will change to one time feeding.
It turns out to be a bug in libxml, see this: https://bugs.launchpad.net/lxml/+bug/2058828
Maybe check etree.LIBXML_VERSION
, and provide the one-off alternative for versions < 2.11.
Will try to do it.
https://github.com/rajatomar788/pywebcopy/blob/9f35b4b6a4da2125e70d8f7a21100de1f09012f4/pywebcopy/parsers.py#L104
Here if it breaks between a
href
, nothing will be further parsed.See example:
Wrong:
Expected:
It may be better just to feed all at once.