openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
216 stars 25 forks source link

Fix saxutils #299

Closed benoit74 closed 3 months ago

benoit74 commented 3 months ago

Fix #298

Changes:

benoit74 commented 3 months ago

@rgaudin do you have any idea why I had to make this change / how it worked before?

benoit74 commented 3 months ago

Tested locally and it works way better (inside Docker):

docker run -it --rm -v $(pwd):/data -p 8888:8888 --entrypoint /usr/local/bin/kiwix-serve ghcr.io/rgaudin/kiwix-tools:nightly --port=8888 "/data/output/beer_meta.zim"
rgaudin commented 3 months ago

@rgaudin do you have any idea why I had to make this change / how it worked before?

I have no idea. All the recipes have their debug logs gone so I started another task to get it.

[ThreadPoolExecutor-0_0::2024-03-27 11:02:32,911] INFO:Extracting 3dprinting.stackexchange.com.7z
[MainThread::2024-03-27 11:02:36,470] INFO:removed badges headers
[MainThread::2024-03-27 11:02:36,543] INFO:sorted Badges by UserId
[MainThread::2024-03-27 11:02:36,605] INFO:removed users headers
[MainThread::2024-03-27 11:02:36,797] INFO:merged both sets
[MainThread::2024-03-27 11:02:36,847] INFO:removed comments headers
[MainThread::2024-03-27 11:02:36,935] INFO:sorted Comments by UserId
[MainThread::2024-03-27 11:02:37,024] INFO:removed posts headers
[MainThread::2024-03-27 11:02:37,267] INFO:merged Posts and Comments
[MainThread::2024-03-27 11:02:37,429] INFO:split Posts-Comments by PostType
[MainThread::2024-03-27 11:02:37,495] INFO:Extracted Post IDs and titles into CSV
[MainThread::2024-03-27 11:02:37,695] INFO:sorted Posts-Comments (questions) by Id
[MainThread::2024-03-27 11:02:37,789] INFO:sorted Posts-Comments (answers) by ParentId
[MainThread::2024-03-27 11:02:37,792] INFO:removed postlinks headers
[MainThread::2024-03-27 11:02:37,805] INFO:sorted PostLinks by PostId
[MainThread::2024-03-27 11:02:37,828] INFO:sorted named post links by RelatedPostId
[MainThread::2024-03-27 11:02:38,024] INFO:Prepared dumps completed.

The line [MainThread::2024-03-27 11:02:37,495] INFO:Extracted Post IDs and titles into CSV indicates the process did not crash.

When running this, a ton of stuff has already been imported. Is it possible that this module was imported by another one?

Tested locally and it works way better (inside Docker):

Why are you sharing this kiwix-serve command? How is it related?

benoit74 commented 3 months ago

Why are you sharing this kiwix-serve command? How is it related?

Because, you know, sometimes, copy-paste is not that easy ^^

Proper test command:

sotoki --domain "beer.meta.stackexchange.com" --threads 20 --output /output/ --zim-file beer_meta.zim --mirror "https://org-kiwix-stackexchange.s3.us-west-1.wasabisys.com" --redis-url "unix:///var/run/redis.sock" --debug