openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
216 stars 25 forks source link

ERROR:b'UserId' is not in list #305

Closed kelson42 closed 2 months ago

kelson42 commented 3 months ago

Unable to scrape 3dprinting https://farm.openzim.org/recipes/3dprinting.stackexchange.com_en

[ThreadPoolExecutor-0_0::2024-04-08 21:21:41,161] INFO:Extracting 3dprinting.stackexchange.com.7z
[MainThread::2024-04-08 21:21:43,073] INFO:removed badges headers
[MainThread::2024-04-08 21:21:43,073] ERROR:FAILED. An error occurred: b'UserId' is not in list
[MainThread::2024-04-08 21:21:43,074] ERROR:b'UserId' is not in list
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/entrypoint.py", line 348, in main
    sys.exit(scraper.run())
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/scraper.py", line 164, in run
    ark_manager.check_and_prepare_dumps()
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/archives.py", line 160, in check_and_prepare_dumps
    merge_users_with_badges(workdir=self.build_dir, delete_src=self.delete_src)
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/utils/preparation.py", line 511, in merge_users_with_badges
    sort_dump_by_id(
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/utils/preparation.py", line 94, in sort_dump_by_id
    func(src=src, dst=dst, field_num=get_index_in(src, id_attr), delete_src=delete_src)
  File "/usr/local/lib/python3.8/site-packages/sotoki-2.1.0-py3.8.egg/sotoki/utils/preparation.py", line 51, in get_index_in
    return re.split(rb'\s([a-zA-Z]+)="', line).index(id_attr.encode(UTF8))
ValueError: b'UserId' is not in list
[MainThread::2024-04-08 21:21:43,075] DEBUG:Removing /3dprinting.stackexchange.com_7i2zdhz6
benoit74 commented 2 months ago

This issue seems to be in fact impacting all stackexchange. At least all new tasks seems to be failing. I'm investigating.

benoit74 commented 2 months ago

Looks like issue is linked to the fact that XML dumps are stored in UTF-16-LE while most code seems to expect UTF-8 files.

@rgaudin does it ring any bell in your memory?

rgaudin commented 2 months ago

No ; what's happening exactly? Nothing gets parsed at all?

benoit74 commented 2 months ago

Yup, not parsed at all. Reencoding allows to go a little bit further but still many issues to fix. Obviously SO dumper has been updated + there are "maybe" too many magic values in sotoki ^^