openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
217 stars 25 forks source link

Invalid characters found in XML #311

Closed benoit74 closed 4 months ago

benoit74 commented 5 months ago

At least apple.stackexchange.com domain, but probably others, seems to have invalid characters in XML dumps

One sample SO answer causing issue (as originally presented in the Posts.xml found in the 7z archive, i.e. without any reencoding, merging of answers with posts ...):

<row Id="208513" PostTypeId="2" ParentId="1621" CreationDate="2015-10-01T21:06:37.420" Score="3" Body="&lt;p&gt;Safari 9.0, released today for 10.9 Mavericks and 10.10 Yosemite, along side with the 10.11 El Capitan release, has introduced this natively. &lt;/p&gt;&#x0A;&#x0A;&lt;p&gt;&lt;a href=&quot;https://i.stack.imgur.com/iUF2I.png&quot; rel=&quot;nofollow noreferrer&quot;&gt;&lt;img src=&quot;https://i.stack.imgur.com/iUF2I.png&quot; alt=&quot;enter image description here&quot;&gt;&lt;/a&gt;&lt;/p&gt;&#x0A;&#x0A;&lt;p&gt;There are two options. If checked, &#x0E;⌘-1 through ⌘-9 will switch to tab 1 through 9, respectfully. This will change the original behavior of mapping to the favorites bar bookmark 1 through 9. Those will be changed to Command-Option-1 through Command-Option-9.&lt;/p&gt;&#x0A;&#x0A;&lt;p&gt;&lt;strong&gt;If unchecked, while it does not say it&lt;/strong&gt;, the behavior is reversed. Command-Number will still be the favorites bar bookmark, while Command-Option-Number will be the tab. This flipping of behavior matches the Command-Click and Command-Option click behavior for opening a link.&lt;/p&gt;&#x0A;&#x0A;&lt;p&gt;It's undocumented.&lt;/p&gt;&#x0A;" OwnerUserId="79973" LastActivityDate="2015-10-01T21:06:37.420" CommentCount="1" ContentLicense="CC BY-SA 3.0"/>

Invalid characters seems to be &#x0E; with is supposed to be a Shift Out control character.

Seems to be valid only in XML 1.1

benoit74 commented 5 months ago

See https://www.w3.org/TR/2006/REC-xml11-20060816/ for details

rgaudin commented 5 months ago

Is there a version attr on the files? Maybe we could add it or modify it prior to parsing

benoit74 commented 5 months ago

There is none, but adding one with 1.1 version does not help sax unfortunately ... I tried with lxml without much success either.

benoit74 commented 5 months ago

I will release 2.1.1 as-is, since it might help anyway, and this could even be considered an upstream bug anyway.

https://farm.openzim.org/pipeline/a6ff9d01-b7ba-463c-a399-243cecbe2c7f worked just fine for instance.

benoit74 commented 5 months ago

Looks like files produced by SO have been corrupted and we of course didn't grabbed the updated version.

When I download manually the 7z archive manually from https://archive.org/download/stackexchange/apple.stackexchange.com.7z (on https://archive.org/download/stackexchange/ page), dated 06-Apr-2024 22:25, I do not have the invalid characters on post ID 208513 and other chars are escaped differently. And the file is again encoded as UTF-8

I've renamed apple.stackexchange.com.7z to apple.stackexchange.com.7z.old on wasabi, and apple.meta.stackexchange.com.7z to apple.meta.stackexchange.com.7z.old.

I've restarted the watcher to redownload these archives. As expected they are different.

rgaudin commented 5 months ago

Ah ; interesting ; I think we had the case once already where they uploaded crap and fixed it afterwards but we were using our copy of the bad version.

benoit74 commented 5 months ago

With the new archive, task failed with sotoki 2.1.1: https://farm.openzim.org/pipeline/e3eb4040-c7ce-45e4-969f-eb4252e22a14

But it seems to be totally ok with 2.1.0 (still not finished ATM, but progressing normally) : https://farm.openzim.org/pipeline/ca1a0dda-04ba-453f-a080-c8efe66fe8c0

I fail so far to find information on SE meta about whether we should simply drop all changes of 2.1.1 or if corresponding changes have just been rolled-back at SE level but will come back soon.

I hope I did not lost 1 day of development for nothing ...

benoit74 commented 4 months ago

See #313, this is not an expected change.