Closed benoit74 closed 4 months ago
See https://www.w3.org/TR/2006/REC-xml11-20060816/ for details
Is there a version attr on the files? Maybe we could add it or modify it prior to parsing
There is none, but adding one with 1.1 version does not help sax unfortunately ... I tried with lxml without much success either.
I will release 2.1.1 as-is, since it might help anyway, and this could even be considered an upstream bug anyway.
https://farm.openzim.org/pipeline/a6ff9d01-b7ba-463c-a399-243cecbe2c7f worked just fine for instance.
Looks like files produced by SO have been corrupted and we of course didn't grabbed the updated version.
When I download manually the 7z archive manually from https://archive.org/download/stackexchange/apple.stackexchange.com.7z (on https://archive.org/download/stackexchange/ page), dated 06-Apr-2024 22:25, I do not have the invalid characters on post ID 208513 and other chars are escaped differently. And the file is again encoded as UTF-8
I've renamed apple.stackexchange.com.7z to apple.stackexchange.com.7z.old on wasabi, and apple.meta.stackexchange.com.7z to apple.meta.stackexchange.com.7z.old.
I've restarted the watcher to redownload these archives. As expected they are different.
Ah ; interesting ; I think we had the case once already where they uploaded crap and fixed it afterwards but we were using our copy of the bad version.
With the new archive, task failed with sotoki 2.1.1: https://farm.openzim.org/pipeline/e3eb4040-c7ce-45e4-969f-eb4252e22a14
But it seems to be totally ok with 2.1.0 (still not finished ATM, but progressing normally) : https://farm.openzim.org/pipeline/ca1a0dda-04ba-453f-a080-c8efe66fe8c0
I fail so far to find information on SE meta about whether we should simply drop all changes of 2.1.1 or if corresponding changes have just been rolled-back at SE level but will come back soon.
I hope I did not lost 1 day of development for nothing ...
See #313, this is not an expected change.
At least
apple.stackexchange.com
domain, but probably others, seems to have invalid characters in XML dumpsOne sample SO answer causing issue (as originally presented in the
Posts.xml
found in the 7z archive, i.e. without any reencoding, merging of answers with posts ...):Invalid characters seems to be

with is supposed to be aShift Out
control character.Seems to be valid only in XML 1.1