mikwielgus / forum-dl

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
MIT License
68 stars 2 forks source link

vBulletin errors #12

Open oderyn opened 1 year ago

oderyn commented 1 year ago

I would like to use forum-dl to generate a list of links from a given forum that I could then send to SingleFile to generate HTML pages of all posts in the thread.

I am using this command:

forum-dl -g --no-boards --no-files https://forum.com/forums/showthread.php?12345-title-of-the-post/page19

When I run this command for vbulletin, it does not generate a list of all 19 pages in the thread as I would expect to happen -- just the one page that I entered. Like so:

https://forum.com/forums/showthread.php
https://forum.com/forums/showthread.php?12345-title-of-the-post/page19
https://forum.com/

This happens no matter which page in the forum I pass into forum-dl.

When I add -v to the above command, I get the following output:

DEBUG:root:Attempting GET https://forum.com/forums/showthread.php {} {}
https://forum.com/forums/showthread.php
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): forum.com:443
DEBUG:urllib3.connectionpool:https://forum.com:443 "GET /forums/showthread.php HTTP/1.1" 200 None
DEBUG:root:Attempting GET https://forum.com/forums/showthread.php?12345-title-of-the-post/page19 {} {}
https://forum.com/forums/showthread.php?12345-title-of-the-post/page19
DEBUG:urllib3.connectionpool:https://forum.com:443 "GET /forums/showthread.php?12345-title-of-the-post/page19 HTTP/1.1" 200 None
DEBUG:root:Attempting GET https://forum.com/forums/showthread.php {} {}
DEBUG:root:Attempting GET https://forum.com/forums/showthread.php {} {}
DEBUG:root:Attempting GET https://forum.com/forums/showthread.php?12345-title-of-the-post/page19 {} {}
DEBUG:root:Attempting GET https://forum.com/forums/ {} {}
https://forum.com/forums/
DEBUG:urllib3.connectionpool:https://forum.com:443 "GET /forums/ HTTP/1.1" 200 None
DEBUG:root:Attempting GET https://forum.com/forums/showthread.php?12345-title-of-the-post/page19 {} {}
DEBUG:root:Attempting GET https://forum.com/forums/ {} {}

I tried running the command to output the files to a directory:

forum-dl --files-output="test/" https://forum.com/forums/showthread.php?12345-title-of-the-post/page19

I got the following error:

INFO:root:GET https://forum.com/forums/showthread.php {} {}
INFO:root:GET https://forum.com/forums/showthread.php?12345-title-of-the-post/page19 {} {}
INFO:root:GET https://forum.com/ {} {}
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.10.11/bin/forum-dl", line 8, in <module>
    sys.exit(main())
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/__init__.py", line 34, in main
    forumdl.download(
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/forumdl.py", line 24, in download
    self.download_url(
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/forumdl.py", line 48, in download_url
    writer.write(url)
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/writers/common.py", line 78, in write
    self.write_board(base_node)
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/writers/common.py", line 103, in write_board
    self._write_board_object(board)
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/writers/common.py", line 235, in _write_board_object
    sys.stdout.write(f"{self._serialize_entry(entry)}\n")
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/forum_dl/writers/jsonl.py", line 10, in _serialize_entry
    return entry.json(models_as_dict=False)
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/typing_extensions.py", line 2562, in wrapper
    return __arg(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/pydantic/main.py", line 950, in json
    raise TypeError('The `models_as_dict` argument is no longer supported; use a model serializer instead.')
TypeError: The `models_as_dict` argument is no longer supported; use a model serializer instead.

--

Result of pip3 --version

pip 23.2.1 from /home/user/.pyenv/versions/3.10.11/lib/python3.10/site-packages/pip (python 3.10)

Result of uname -a

Linux computername 5.19.0-46-generic #47-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 16 13:30:11 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Result of cat /etc/os-release

PRETTY_NAME="Ubuntu 22.10"
NAME="Ubuntu"
VERSION_ID="22.10"
VERSION="22.10 (Kinetic Kudu)"
VERSION_CODENAME=kinetic
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=kinetic
LOGO=ubuntu-logo
mikwielgus commented 1 year ago

You're passing the URL of page 19 (https://forum.com/forums/showthread.php?12345-title-of-the-post/page19). Pass the URL of the first page instead: https://forum.com/forums/showthread.php?12345-title-of-the-post.

oderyn commented 1 year ago

Ah! That makes sense.

...

Just tried it. Unfortunately, I got the same result. To be clear, I tried the following:

forum-dl -g --no-boards --no-files https://forum.com/forums/showthread.php?12345-title-of-the-post/

and

forum-dl -g --no-boards --no-files https://forum.com/forums/showthread.php?12345-title-of-the-post/page1

Any other pointers? Otherwise, I can do some more troubleshooting on my own a little later.

mikwielgus commented 1 year ago

Given that it downloads only one thread page, most likely the CSS selector Forum-dl uses to find the next page link is not working on this website in particular (perhaps it's using an older vBulletin version, or is heavily themed). I should be able to fix this if you give me the real link to this forum.

oderyn commented 1 year ago

Here are are a couple that I am having issue with:

I am pretty sure I came across a few others that forum-dl wasn't grabbing. I'll need to dig back through my notes as I was testing a variety of forums to see how it performed. I know there was a Xenforo forum, too.

If you're grabbing the pagination via CSS, that's definitely something I could do -- in theory, at least. Have you considered adding these selectors to a config file that users could extend? Or, if you point me to where you are adding it in the code, I could add new ones as I come across them and submit pull requests -- if you're at all interested in that sort of help. I hate to pester you with requests every time I come across something that doesn't work.

BTW, this is a great tool. It does most of what I've been looking for in regards to grabbing full forum threads.

Thanks!

mikwielgus commented 1 year ago

Have you considered adding these selectors to a config file that users could extend?

Yes. Actually, I intend to make it available through a command line switch. But I haven't started working on it yet because I had to focus on another project.

Or, if you point me to where you are adding it in the code, I could add new ones as I come across them and submit pull requests -- if you're at all interested in that sort of help. I hate to pester you with requests every time I come across something that doesn't work.

Pull requests are very much welcome. Vbulletin's next thread page CSS selector is here. Should be trivial to patch.

I've reverted some recent experimental and broken code from develop so that you won't have to bother with it.

BTW, this is a great tool. It does most of what I've been looking for in regards to grabbing full forum threads.

Thanks!

You're welcome.

mikwielgus commented 11 months ago

Do you have any intention to work on this issue in the near time (as you implied you're interested in that)? I would like to release 0.3.1 in the next few weeks with this bug fixed, would be good to know whether I should wait or fix this myself.

oderyn commented 11 months ago

@mikwielgus Apologies for not getting back to you sooner. I thought I could fix the issue, but it was beyond my programming skill level (novice, for sure).

math-ematics commented 7 months ago

I get the same error, but I am trying to save down the page "view all posts by user"

not the normal forum page itself. its still posts but just another path.

bleomycin commented 2 months ago

Thanks for making this! Just wanted to add i'm experiencing the same error as the OP with: https://www.rcgroups.com/forums/showthread.php?1074181-Lipo-Storage-Voltage-and-Dead-Battery-Information