mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
95 stars 14 forks source link

TypeError: cannot use a string pattern on a bytes-like object #29

Closed yzqzss closed 1 year ago

yzqzss commented 2 years ago

Python 3.10.6 Linux Mint

``` $ dumpgenerator http://wiki.othing.xyz --xml Checking API... https://wiki.othing.xyz/api.php API is OK: https://wiki.othing.xyz/api.php Checking index.php... https://wiki.othing.xyz/index.php index.php is OK No --path argument provided. Defaulting to: [working_directory]/[domain_prefix]-[date]-wikidump Which expands to: ./wikiothingxyz-20221023-wikidump ######################################################################### # Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3) # # More info at: https://github.com/elsiehupp/wikiteam3 # ######################################################################### ######################################################################### # Copyright (C) 2011-2022 WikiTeam developers # # This program is free software: you can redistribute it and/or modify # # it under the terms of the GNU General Public License as published by # # the Free Software Foundation, either version 3 of the License, or # # (at your option) any later version. # # # # This program is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU General Public License for more details. # # # # You should have received a copy of the GNU General Public License # # along with this program. If not, see . # ######################################################################### Analysing https://wiki.othing.xyz/api.php Trying generating a new dump into a new directory... Loading page titles from namespaces = all Excluding titles from namespaces = None Sleeping... 0.50 seconds... 22 namespaces found Retrieving titles in the namespace 0 22 titles retrieved in the namespace 0 Sleeping... 0.50 seconds... Retrieving titles in the namespace 1 0 titles retrieved in the namespace 1 Sleeping... 0.50 seconds... Retrieving titles in the namespace 2 1 titles retrieved in the namespace 2 Sleeping... 0.50 seconds... Retrieving titles in the namespace 3 0 titles retrieved in the namespace 3 Sleeping... 0.50 seconds... Retrieving titles in the namespace 4 0 titles retrieved in the namespace 4 Sleeping... 0.50 seconds... Retrieving titles in the namespace 5 0 titles retrieved in the namespace 5 Sleeping... 0.50 seconds... Retrieving titles in the namespace 6 9 titles retrieved in the namespace 6 Sleeping... 0.50 seconds... Retrieving titles in the namespace 7 0 titles retrieved in the namespace 7 Sleeping... 0.50 seconds... Retrieving titles in the namespace 8 2 titles retrieved in the namespace 8 Sleeping... 0.50 seconds... Retrieving titles in the namespace 9 0 titles retrieved in the namespace 9 Sleeping... 0.50 seconds... Retrieving titles in the namespace 10 51 titles retrieved in the namespace 10 Sleeping... 0.50 seconds... Retrieving titles in the namespace 11 0 titles retrieved in the namespace 11 Sleeping... 0.50 seconds... Retrieving titles in the namespace 12 0 titles retrieved in the namespace 12 Sleeping... 0.50 seconds... Retrieving titles in the namespace 13 0 titles retrieved in the namespace 13 Sleeping... 0.50 seconds... Retrieving titles in the namespace 14 14 titles retrieved in the namespace 14 Sleeping... 0.50 seconds... Retrieving titles in the namespace 15 0 titles retrieved in the namespace 15 Sleeping... 0.50 seconds... Retrieving titles in the namespace 828 40 titles retrieved in the namespace 828 Sleeping... 0.50 seconds... Retrieving titles in the namespace 829 0 titles retrieved in the namespace 829 Sleeping... 0.50 seconds... Retrieving titles in the namespace 2300 0 titles retrieved in the namespace 2300 Sleeping... 0.50 seconds... Retrieving titles in the namespace 2301 0 titles retrieved in the namespace 2301 Sleeping... 0.50 seconds... Retrieving titles in the namespace 2302 0 titles retrieved in the namespace 2302 Sleeping... 0.50 seconds... Retrieving titles in the namespace 2303 0 titles retrieved in the namespace 2303 Sleeping... 0.50 seconds... Titles saved at... wikiothingxyz-20221023-titles.txt 139 page titles loaded https://wiki.othing.xyz/api.php Retrieving the XML for every page from "start" Sleeping... 0.50 seconds... /usr/lib/python3/dist-packages/apport/report.py:13: DeprecationWarning: the imp module is deprecated in favour of importlib and slated for removal in Python 3.12; see the module's documentation for alternative uses import fnmatch, glob, traceback, errno, sys, atexit, locale, imp, stat Traceback (most recent call last): File "/home/yzqzss/.local/bin/dumpgenerator", line 8, in sys.exit(main()) File "/home/yzqzss/.local/lib/python3.10/site-packages/wikiteam3/dumpgenerator/__init__.py", line 26, in main DumpGenerator() File "/home/yzqzss/.local/lib/python3.10/site-packages/wikiteam3/dumpgenerator/generator.py", line 87, in __init__ DumpGenerator.createNewDump(config=config, other=other) File "/home/yzqzss/.local/lib/python3.10/site-packages/wikiteam3/dumpgenerator/generator.py", line 100, in createNewDump generateXMLDump(config=config, titles=titles, session=other["session"]) File "/home/yzqzss/.local/lib/python3.10/site-packages/wikiteam3/dumpgenerator/xml_dump.py", line 92, in generateXMLDump xml = cleanXML(xml=xml) File "/home/yzqzss/.local/lib/python3.10/site-packages/wikiteam3/dumpgenerator/util.py", line 73, in cleanXML if re.search(r"", xml): File "/usr/lib/python3.10/re.py", line 200, in search return _compile(pattern, flags).search(string) TypeError: cannot use a string pattern on a bytes-like object ```
Dss0 commented 2 years ago

bumping, same issue here on windows with python 3.8.10.

RedSparr0w commented 1 year ago

Same issue on the following OS's with Miraheze wiki

Python 3.10.8 - Alpine 3.17 Python 3.9.6 - Debian 11

RedSparr0w commented 1 year ago

Note: the error only happens for me if not using --xmlrevisions

elsiehupp commented 1 year ago

I'm getting this again (on https://github.com/mediawiki-client-tools/mediawiki-scraper/commit/a0aa9b301ad9ec3c15d4e9ee0e993c12347d44ac):

```bash % dumpgenerator --failfast --xml https://elinux.org Checking API... https://elinux.org/api.php API is OK: https://elinux.org/api.php Checking index.php... https://elinux.org/index.php index.php is OK No --path argument provided. Defaulting to: [working_directory]/[domain_prefix]-[date]-wikidump Which expands to: ./elinuxorg-20230217-wikidump ######################################################################### # Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3) # # More info at: https://github.com/elsiehupp/wikiteam3 # ######################################################################### ######################################################################### # Copyright (C) 2011-2023 WikiTeam developers # # This program is free software: you can redistribute it and/or modify # # it under the terms of the GNU General Public License as published by # # the Free Software Foundation, either version 3 of the License, or # # (at your option) any later version. # # # # This program is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU General Public License for more details. # # # # You should have received a copy of the GNU General Public License # # along with this program. If not, see . # ######################################################################### Analysing https://elinux.org/api.php Trying generating a new dump into a new directory... Loading page titles from namespaces = all Excluding titles from namespaces = None Sleeping... 0.50 seconds... 32 namespaces found Retrieving titles in the namespace 0 4957 titles retrieved in the namespace 0 Sleeping... 0.50 seconds... Retrieving titles in the namespace 1 346 titles retrieved in the namespace 1 Sleeping... 0.50 seconds... Retrieving titles in the namespace 2 6829 titles retrieved in the namespace 2 Sleeping... 0.50 seconds... Retrieving titles in the namespace 3 6394 titles retrieved in the namespace 3 Sleeping... 0.50 seconds... Retrieving titles in the namespace 4 17 titles retrieved in the namespace 4 Sleeping... 0.50 seconds... Retrieving titles in the namespace 5 1 titles retrieved in the namespace 5 Sleeping... 0.50 seconds... Retrieving titles in the namespace 6 8589 titles retrieved in the namespace 6 Sleeping... 0.50 seconds... Retrieving titles in the namespace 7 3 titles retrieved in the namespace 7 Sleeping... 0.50 seconds... Retrieving titles in the namespace 8 10 titles retrieved in the namespace 8 Sleeping... 0.50 seconds... Retrieving titles in the namespace 9 1 titles retrieved in the namespace 9 Sleeping... 0.50 seconds... Retrieving titles in the namespace 10 468 titles retrieved in the namespace 10 Sleeping... 0.50 seconds... Retrieving titles in the namespace 11 7 titles retrieved in the namespace 11 Sleeping... 0.50 seconds... Retrieving titles in the namespace 12 65 titles retrieved in the namespace 12 Sleeping... 0.50 seconds... Retrieving titles in the namespace 13 3 titles retrieved in the namespace 13 Sleeping... 0.50 seconds... Retrieving titles in the namespace 14 313 titles retrieved in the namespace 14 Sleeping... 0.50 seconds... Retrieving titles in the namespace 15 3 titles retrieved in the namespace 15 Sleeping... 0.50 seconds... Retrieving titles in the namespace 828 23 titles retrieved in the namespace 828 Sleeping... 0.50 seconds... Retrieving titles in the namespace 829 0 titles retrieved in the namespace 829 Sleeping... 0.50 seconds... Retrieving titles in the namespace 90 722 titles retrieved in the namespace 90 Sleeping... 0.50 seconds... Retrieving titles in the namespace 91 0 titles retrieved in the namespace 91 Sleeping... 0.50 seconds... Retrieving titles in the namespace 92 1 titles retrieved in the namespace 92 Sleeping... 0.50 seconds... Retrieving titles in the namespace 93 0 titles retrieved in the namespace 93 Sleeping... 0.50 seconds... Retrieving titles in the namespace 502 45 titles retrieved in the namespace 502 Sleeping... 0.50 seconds... Retrieving titles in the namespace 503 3 titles retrieved in the namespace 503 Sleeping... 0.50 seconds... Retrieving titles in the namespace 504 112 titles retrieved in the namespace 504 Sleeping... 0.50 seconds... Retrieving titles in the namespace 505 23 titles retrieved in the namespace 505 Sleeping... 0.50 seconds... Retrieving titles in the namespace 506 26 titles retrieved in the namespace 506 Sleeping... 0.50 seconds... Retrieving titles in the namespace 507 1 titles retrieved in the namespace 507 Sleeping... 0.50 seconds... Retrieving titles in the namespace 2300 0 titles retrieved in the namespace 2300 Sleeping... 0.50 seconds... Retrieving titles in the namespace 2301 0 titles retrieved in the namespace 2301 Sleeping... 0.50 seconds... Retrieving titles in the namespace 2302 0 titles retrieved in the namespace 2302 Sleeping... 0.50 seconds... Retrieving titles in the namespace 2303 0 titles retrieved in the namespace 2303 Sleeping... 0.50 seconds... Titles saved at... elinuxorg-20230217-titles.txt 28962 page titles loaded https://elinux.org/api.php Retrieving the XML for every page from "start" Sleeping... 0.50 seconds... Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.8/bin/dumpgenerator", line 8, in sys.exit(main()) File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/__init__.py", line 26, in main DumpGenerator() File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/generator.py", line 87, in __init__ DumpGenerator.createNewDump(config=config, other=other) File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/generator.py", line 100, in createNewDump generateXMLDump(config=config, titles=titles, session=other["session"]) File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/xml_dump.py", line 92, in generateXMLDump xml = cleanXML(xml=xml) File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/util.py", line 73, in cleanXML if re.search(r"", xml): File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/re.py", line 201, in search return _compile(pattern, flags).search(string) TypeError: cannot use a string pattern on a bytes-like object ```

(I was trying to reproduce the animation issue on https://github.com/mediawiki-client-tools/mediawiki-scraper/issues/121, but (a) my ZSH terminal on macOS doesn't seem to support the animation in the first place, and (b) this crash is happening before the process gets to downloading images.)

yzqzss commented 1 year ago

I'm getting this again (on a0aa9b3):

Strange, the same command, I can run successfully...

...
Retrieving the XML for every page from the beginning

Retrieving the XML for every page

    (Embedded) Linux debugging/profiling/tracing tools - Overview, 1 edit
    /BeagleBoard/GSoC/GPIO-parallel-bi-dir-bus, 2 edits
    /Jetson/L4T/Camera BringUp, 1 edit
    0-day survey response, 4 edits
...

Python 3.10.6 on Linux Mint.

yzqzss commented 1 year ago

I noticed your Traceback. This is the old directory structure (before NyaMisty refactored it).

wikiteam3/dumpgenerator/generator.py
wikiteam3/dumpgenerator/xml_dump.py
wikiteam3/dumpgenerator/util.py

Maybe you forgot run "poetry build" or "pip install --force-reinstall"?

elsiehupp commented 1 year ago

I just re-ran poetry build and pip install --force-reinstall, and now I'm getting this error:

```bash % dumpgenerator --failfast --xml https://elinux.org Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.8/bin/dumpgenerator", line 5, in from wikiteam3.dumpgenerator import main File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/__init__.py", line 22, in from wikiteam3.dumpgenerator.dump import DumpGenerator File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/dump/__init__.py", line 1, in from .generator import DumpGenerator File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/dump/generator.py", line 27, in from wikiteam3.dumpgenerator.cli import getParameters, bye, welcome File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/cli/__init__.py", line 1, in from .cli import getParameters File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/cli/cli.py", line 14, in from wikiteam3.dumpgenerator.api import checkRetryAPI, mwGetAPIAndIndex File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/api/__init__.py", line 1, in from .api import checkAPI, checkRetryAPI, mwGetAPIAndIndex File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/dumpgenerator/api/api.py", line 10, in from wikiteam3.utils import getUserAgent File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/utils/__init__.py", line 8, in from .login import uniLogin, fetchLoginToken, botLogin, clientLogin, indexLogin File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/utils/login/__init__.py", line 6, in from wikiteam3.utils.login.api import botLogin, clientLogin, fetchLoginToken File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/utils/login/api.py", line 6, in def fetchLoginToken(session: requests.Session, api: str) -> str|None: TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' ```

I honestly have no idea what's going on. The only thing I changed recently is updating from macOS 12 to macOS 13.2. I imagine maybe the major OS update did something strange to the Python subsystem?

I ran python --version and got Python 3.8.3, though, which seems normal enough. To emphasize, I'm running this test in a separate Terminal.app window outside the Poetry virtual environment, which is inside VS Code, so it shouldn't be anything related to the dev environment.

(Yes, I recognize that this is no longer the same issue, but it would be nice to get my copy working again... 😬)

(As an aside, this is an instance where actual version/build numbers would be useful...)

yzqzss commented 1 year ago

I just re-ran poetry build and pip install --force-reinstall, and now I'm getting this error:

...
File "/Users/elsiehupp/Library/Python/3.8/lib/python/site-packages/wikiteam3/utils/login/api.py", line 6, in <module>
def fetchLoginToken(session: requests.Session, api: str) -> str|None:
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Fixed in https://github.com/mediawiki-client-tools/mediawiki-scraper/pull/130/commits/9ffba70740f647bbbb8c16a6aa3d9a0b485a28a9. #130

test status

I also found some problems in uploader.py in this test action (dumpgenerator is fine), so I marked this pull request as draft.

Update: Done.