iiab / iiab

Internet-in-a-Box - Build your own LIBRARY OF ALEXANDRIA with a Raspberry Pi !
https://internet-in-a-box.org
GNU General Public License v2.0
961 stars 75 forks source link

(1) IIAB install froze for ~5min at "Parsing xml downloads from Kiwix" (2) Should we ask everyone to ignore the XML warning? #3561

Open holta opened 1 year ago

holta commented 1 year ago

(1) Here's where it froze:

Downloading Catalogs and Building Local Data Files. Starting xml download from Kiwix Reading Our Catalog Parsing xml downloads from Kiwix /usr/lib/python3/dist-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor. warnings.warn(

[ It froze here for about ~5 min.]

/usr/bin/iiab-get-kiwix-cat: line 4: 76747 Killed $CMDSRV_SCRIPTS/get_kiwix_catalog -v Getting the Kiwix Catalog Failed. Please run iiab-get-kiwix-cat again later.

(2) Running iiab-get-kiwix-cat later worked. With the same ugly XML/HTML warning that's presumably harmless — but might be clarified so this is put in context so that "~5 min of dead air" delays aren't so confusing.

iiab-diagnostics: http://sprunge.us/IiIYRG?en

Possibly related:

tim-moody commented 1 year ago

I have seen it fail, but not hang, so the question is can this be reproduced.

The warning about xml, which I also have not seen, is very strange because it is parsed with the line

bs_content = BeautifulSoup(utf8_xlm, "lxml")

which is exactly what the warning says to do.

I think the kiwix xml is in flux because they are adding a utf-8 meta clause which is still in nightlies.

I just ran it and got:

root@box:~# /usr/bin/iiab-get-kiwix-cat
Starting xml download from Kiwix
Reading Our Catalog
Parsing xml downloads from Kiwix
Starting of processing xml download from Kiwix to /etc/iiab/kiwix_catalog.json
Starting of processing our zim catalog into /etc/iiab/kiwix_catalog.json
Ready to write /etc/iiab/kiwix_catalog.json
Finished writing to /etc/iiab/kiwix_catalog.json

I ran it twice, once with

#bs_content = BeautifulSoup(r.content, "lxml")
utf8_xlm = r.content.decode("utf-8")
bs_content = BeautifulSoup(utf8_xlm, "lxml")

and once with

bs_content = BeautifulSoup(r.content, "lxml")
#utf8_xlm = r.content.decode("utf-8")
#bs_content = BeautifulSoup(utf8_xlm, "lxml")
tim-moody commented 1 year ago

Could lxml not be installed on your test system?

tim-moody commented 1 year ago

python3-lxml/jammy,now 4.8.0-1build1 amd64 [installed]

holta commented 1 year ago

/var/log/apt/history.log shows python3-bs4 also installed python3-lxml, presumably during Admin Console's install:

Start-Date: 2023-04-30 20:13:43 Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold install python3-bs4=4.11.2-2 Install: python3-lxml:amd64 (4.9.2-1+b1, automatic), python3-soupsieve:amd64 (2.3.2-1, automatic), python3-bs4:amd64 (4.11.2-2), python3-webencodings:amd64 (0.5.1-5, automatic), python3-html5lib:amd64 (1.1-3, automatic) End-Date: 2023-04-30 20:13:46

As reconfirmed by:

root@box:~# apt list python3-bs4 python3-lxml
Listing... Done
python3-bs4/testing,now 4.11.2-2 all [installed]
python3-lxml/testing,now 4.9.2-1+b1 amd64 [installed]

Possibly this warning is part of the new version(s) above ??

Certainly the XML/HTML warning (very long line below) still appears FWIW:

root@box:~# iiab-get-kiwix-cat
Starting xml download from Kiwix
Reading Our Catalog
Parsing xml downloads from Kiwix
/usr/lib/python3/dist-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(
Starting of processing xml download from Kiwix to /etc/iiab/kiwix_catalog.json
Starting of processing our zim catalog into /etc/iiab/kiwix_catalog.json
Ready to write /etc/iiab/kiwix_catalog.json
Finished writing to /etc/iiab/kiwix_catalog.json
SUCCESSroot@box:~#
tim-moody commented 1 year ago

what platform? I found https://groups.google.com/g/beancount/c/axgz8LNrYbM?pli=1 that mentions arm64

holta commented 1 year ago

what platform?

Debian 12 RC2+ on x86_64.

Not sure if the details at the top of this ticket help, but they are: http://sprunge.us/IiIYRG?en

tim-moody commented 1 year ago

I'll check back in June more or less.

holta commented 1 year ago

Ubuntu 23.04 (released April 20th) has the identical issue.

Which is no surprise, as the 2 OS's (Ubuntu 23.04 and Debian 12) share nearly the identical version of many/most apt packages, including:

root@u23:~# apt list python3-bs4 python3-lxml
Listing... Done
python3-bs4/lunar,now 4.11.2-2 all [installed]
python3-lxml/lunar,now 4.9.2-1build1 amd64 [installed]
tim-moody commented 1 year ago

This might be the answer https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

BeautifulSoup(markup, "lxml") installs an html parser and lxml-xml installs an xml parser

But if I make that change the python is broken

tim-moody commented 1 year ago

I think the issue is that with xml case is enforced, so articlecount has to be articleCount.

holta commented 1 year ago

I think the issue is that with xml case is enforced, so articlecount has to be articleCount.

Where does the lowercase articlecount appear? I'm not following the details, but just for the record:

root@box:~# wget https://library.kiwix.org/catalog/v2/entries?count=-1
root@box:~# grep -c articlecount entries\?count\=-1
0
root@box:~# grep -c articleCount entries\?count\=-1
4117

ASIDE: If intermittent ~5min freezes are perhaps (?) due to Internet slowness / hosting overload issues while trying to download the ~4.8MB https://library.kiwix.org/catalog/v2/entries?count=-1 — then presumably there's nothing we can do about that.

tim-moody commented 1 year ago

it also occurred to me that a package will compile into .pyc the first time used

tim-moody commented 1 year ago

https://github.com/iiab/iiab-admin-console/pull/540

holta commented 1 year ago

Here's reconfirming that @tim-moody's PR has now eliminated the XML/HTML warning message on Ubuntu 23.04:

holta commented 1 year ago

@tim-moody's PR has now eliminated the XML/HTML warning message on Ubuntu 23.04

Output is also confirmed to now be cleaner on Debian 12.