openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
217 stars 25 forks source link

ZIM creation fails with KeyError related to tag ids #293

Closed IMayBeABitShy closed 6 months ago

IMayBeABitShy commented 7 months ago

Since roughly two weeks ago I've been getting a KeyError related to tag ids when trying to build ZIM files. I've been waiting to see if the build also fails on the zimfarm, but so far no sheduled sotoki run occured since then.

Traceback:

Traceback (most recent call last):
  File "sotoki/scraper.py", line 236, in start
    self.process_tags()
  File "sotoki/scraper.py", line 343, in process_tags
    TagGenerator().run()
  File "sotoki/tags.py", line 99, in run
    page_content = self.renderer.get_tag_for_page(tag_name, page)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "sotoki/renderer.py", line 186, in get_tag_for_page
    return self.env.get_template("tag.html").render(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "jinja2/environment.py", line 1301, in render
    self.environment.handle_exception()
  File "jinja2/environment.py", line 936, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "sotoki/templates/tag.html", line 1, in top-level template code
    {% extends "base.html" %}
  File "sotoki/templates/base.html", line 97, in top-level template code
    {% block content %}
  File "sotoki/templates/tag.html", line 19, in block 'content'
    {% for question in questions %}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "sotoki/renderer.py", line 57, in extend_questions
    yield Global.database.get_question_details(post_id=post_id, score=score)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "sotoki/utils/database/posts.py", line 160, in get_question_details
    item["tags"] = [self.get_tag_name_for(t) for t in item["tags"]]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "sotoki/utils/database/posts.py", line 160, in <listcomp>
    item["tags"] = [self.get_tag_name_for(t) for t in item["tags"]]
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "sotoki/utils/database/tags.py", line 122, in get_tag_name_for
    return self.tags_ids[tag_id]
           ~~~~~~~~~~~~~^^^^^^^^
  File "bidict/_base.py", line 523, in __getitem__
    return self._fwdm[key]
           ~~~~~~~~~~^^^^^
KeyError: 443

The error occurs relatively late during the build:

[MainThread::2024-02-22 16:03:37,865] INFO:PROGRESS: 99.7% – Step 6/7: Tags -- 0/237 -- Images: 722
[MainThread::2024-02-22 16:03:37,873] ERROR:Interrupting process due to error: 443
[MainThread::2024-02-22 16:03:37,873] ERROR:443
Traceback (most recent call last):
<see above>

Command used:

python3 -m sotoki --debug --domain="sustainability.stackexchange.com" --output="./output" --threads="28" --keep-redis --stats-filename="./output/task_progress.json" --publisher="openZIM" --tmp-dir tmpdir

During debugging, I've:

The ZIM creation worked properly until quite recently, I believe that this bug may be caused by a change in the stack exchange dumps, perhaps a deleted but still referenced tag?

We may want to confirm and fix this before the next wave of sotoki build starts, or there may be a lot of wasted resources for nothing.

benoit74 commented 7 months ago

Thanks a lot for everything: many attention to details, very precise bug report, and nice suggestions.

I will have a look into it tomorrow, but I will probably simply disable all sotoki recipes for now, no need to waste resources, your bug report and investigation seems quite clear for me, I unfortunately do not expect that some stackoverflow domains might still work.

benoit74 commented 7 months ago

I've started two small domains as well: sustainability.stackexchange.com (your suggestion) and tezos.stackexchange.com (my pick, even smaller).

I ran them with the docker image we use in production:

docker run -v $(pwd):/output --name sotoki_sustainability.stackexchange.com_en --detach --rm ghcr.io/openzim/sotoki:2.0.2 sotoki --debug --domain="sustainability.stackexchange.com" --mirror="https://org-kiwix-stackexchange.s3.us-west-1.wasabisys.com" --output="/output" --threads="8" --redis-url="unix:///var/run/redis.sock" --stats-filename="/output/task_progress.json" --keep-redis --publisher="openZIM"
docker run -v $(pwd):/output --name sotoki_tezos.stackexchange.com_en --detach --rm ghcr.io/openzim/sotoki:2.0.2 sotoki --debug --domain="tezos.stackexchange.com" --mirror="https://org-kiwix-stackexchange.s3.us-west-1.wasabisys.com" --output="/output" --threads="8" --redis-url="unix:///var/run/redis.sock" --stats-filename="/output/task_progress.json" --keep-redis --publisher="openZIM"

Both succeeded and produced a working ZIM (I didn't made maybe tests but at least they open and you can browse), so Zimfarm seems to be safe to continue, no hurry, that's a good news. And I now doubt something changed in stackexchange dumps since tests worked.

I wanted to run tezos.stackexchange.com with dev docker image to confirm if the issue is linked to recent changes on main branch but the image fails to start (see https://github.com/openzim/sotoki/issues/294 if needed).

Could you please try to use the code from the 2.0.2 tag to confirm what is happening?

I suspect two possibilities:

IMayBeABitShy commented 7 months ago

Ok, I found the problem and it was a stupid mistake on my side caused by my inexperience when working with redis. To put it simply, I've forgotten to empty the redis server between runs. After manually issuing a flushall the problem no longer occurs.

I apologize for wasting your time.

benoit74 commented 7 months ago

No worries, it is not like you did not investigated at all before raising this issue and it was quite easy (and useful) to confirm everything is fine.