Open holta opened 1 year ago
I have seen it fail, but not hang, so the question is can this be reproduced.
The warning about xml, which I also have not seen, is very strange because it is parsed with the line
bs_content = BeautifulSoup(utf8_xlm, "lxml")
which is exactly what the warning says to do.
I think the kiwix xml is in flux because they are adding a utf-8 meta clause which is still in nightlies.
I just ran it and got:
root@box:~# /usr/bin/iiab-get-kiwix-cat
Starting xml download from Kiwix
Reading Our Catalog
Parsing xml downloads from Kiwix
Starting of processing xml download from Kiwix to /etc/iiab/kiwix_catalog.json
Starting of processing our zim catalog into /etc/iiab/kiwix_catalog.json
Ready to write /etc/iiab/kiwix_catalog.json
Finished writing to /etc/iiab/kiwix_catalog.json
I ran it twice, once with
#bs_content = BeautifulSoup(r.content, "lxml")
utf8_xlm = r.content.decode("utf-8")
bs_content = BeautifulSoup(utf8_xlm, "lxml")
and once with
bs_content = BeautifulSoup(r.content, "lxml")
#utf8_xlm = r.content.decode("utf-8")
#bs_content = BeautifulSoup(utf8_xlm, "lxml")
Could lxml not be installed on your test system?
python3-lxml/jammy,now 4.8.0-1build1 amd64 [installed]
/var/log/apt/history.log
shows python3-bs4 also installed python3-lxml, presumably during Admin Console's install:
Start-Date: 2023-04-30 20:13:43 Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold install python3-bs4=4.11.2-2 Install: python3-lxml:amd64 (4.9.2-1+b1, automatic), python3-soupsieve:amd64 (2.3.2-1, automatic), python3-bs4:amd64 (4.11.2-2), python3-webencodings:amd64 (0.5.1-5, automatic), python3-html5lib:amd64 (1.1-3, automatic) End-Date: 2023-04-30 20:13:46
As reconfirmed by:
root@box:~# apt list python3-bs4 python3-lxml
Listing... Done
python3-bs4/testing,now 4.11.2-2 all [installed]
python3-lxml/testing,now 4.9.2-1+b1 amd64 [installed]
Possibly this warning is part of the new version(s) above ??
Certainly the XML/HTML warning (very long line below) still appears FWIW:
root@box:~# iiab-get-kiwix-cat
Starting xml download from Kiwix
Reading Our Catalog
Parsing xml downloads from Kiwix
/usr/lib/python3/dist-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
warnings.warn(
Starting of processing xml download from Kiwix to /etc/iiab/kiwix_catalog.json
Starting of processing our zim catalog into /etc/iiab/kiwix_catalog.json
Ready to write /etc/iiab/kiwix_catalog.json
Finished writing to /etc/iiab/kiwix_catalog.json
SUCCESSroot@box:~#
what platform? I found https://groups.google.com/g/beancount/c/axgz8LNrYbM?pli=1 that mentions arm64
what platform?
Debian 12 RC2+ on x86_64.
Not sure if the details at the top of this ticket help, but they are: http://sprunge.us/IiIYRG?en
I'll check back in June more or less.
Ubuntu 23.04 (released April 20th) has the identical issue.
Which is no surprise, as the 2 OS's (Ubuntu 23.04 and Debian 12) share nearly the identical version of many/most apt packages, including:
root@u23:~# apt list python3-bs4 python3-lxml
Listing... Done
python3-bs4/lunar,now 4.11.2-2 all [installed]
python3-lxml/lunar,now 4.9.2-1build1 amd64 [installed]
This might be the answer https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
BeautifulSoup(markup, "lxml") installs an html parser and lxml-xml installs an xml parser
But if I make that change the python is broken
I think the issue is that with xml case is enforced, so articlecount has to be articleCount.
I think the issue is that with xml case is enforced, so articlecount has to be articleCount.
Where does the lowercase articlecount
appear? I'm not following the details, but just for the record:
root@box:~# wget https://library.kiwix.org/catalog/v2/entries?count=-1
root@box:~# grep -c articlecount entries\?count\=-1
0
root@box:~# grep -c articleCount entries\?count\=-1
4117
ASIDE: If intermittent ~5min freezes are perhaps (?) due to Internet slowness / hosting overload issues while trying to download the ~4.8MB https://library.kiwix.org/catalog/v2/entries?count=-1 — then presumably there's nothing we can do about that.
it also occurred to me that a package will compile into .pyc the first time used
Here's reconfirming that @tim-moody's PR has now eliminated the XML/HTML warning message on Ubuntu 23.04:
@tim-moody's PR has now eliminated the XML/HTML warning message on Ubuntu 23.04
Output is also confirmed to now be cleaner on Debian 12.
(1) Here's where it froze:
[ It froze here for about ~5 min.]
(2) Running
iiab-get-kiwix-cat
later worked. With the same ugly XML/HTML warning that's presumably harmless — but might be clarified so this is put in context so that "~5 min of dead air" delays aren't so confusing.iiab-diagnostics: http://sprunge.us/IiIYRG?en
Possibly related:
3545