mikwielgus / forum-dl

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
MIT License
68 stars 2 forks source link

Downloading forum: extractor not found #19

Open SanZamoyski opened 1 month ago

SanZamoyski commented 1 month ago

Hi!

forum-dl -f maildir --boards --threads --posts --files https://www.t4-forum.pl/index.php?sid=79f9a8226101647960c17ddccb921b06
INFO:root:GET https://www.t4-forum.pl {} {}
INFO:root:GET https://www.t4-forum.pl/index.php?sid=79f9a8226101647960c17ddccb921b06 {} {}
INFO:root:GET https://www.t4-forum.pl/index.php {} {}
INFO:root:GET https://www.t4-forum.pl/ {} {}
INFO:root:GET https://www.t4-forum.pl/index.php/viewforum.php {} {}
Traceback (most recent call last):
  File "/home/san/.local/bin/forum-dl", line 8, in <module>
    sys.exit(main())
  File "/home/san/.local/lib/python3.10/site-packages/forum_dl/__init__.py", line 34, in main
    forumdl.download(
  File "/home/san/.local/lib/python3.10/site-packages/forum_dl/forumdl.py", line 24, in download
    self.download_url(
  File "/home/san/.local/lib/python3.10/site-packages/forum_dl/forumdl.py", line 40, in download_url
    extractor = extractors.find(url, session_options, extractor_options)
  File "/home/san/.local/lib/python3.10/site-packages/forum_dl/extractors/__init__.py", line 37, in find
    raise ExtractorNotFoundError(url)
forum_dl.exceptions.ExtractorNotFoundError: https://www.t4-forum.pl/index.php?sid=79f9a8226101647960c17ddccb921b06

I want to download whole forum. Also, is there way to log in before downloading or sid is enought?

Thanks and best regards!

mikwielgus commented 1 month ago

Forum-dl probably fails to detect which extractor to use because of the forum's custom theme. There's currently no way to tweak the detection code from the command line, so this can only be fixed manually by modifying the detector code.

There's no way to log in as I haven't implemented cookies yet. Sadly, I haven't had much time to continue this project as I'm currently very busy with another one.

SanZamoyski commented 1 month ago

I understand. Can You point a bit place where to tweak?

mikwielgus commented 1 month ago

The PhpBB detection code starts here. In fact, you can see what is the problem from the first line straight away... But this is probably just the tip of the iceberg. Feel free to open a PR if you can get it working.

But! I have another project, Skrob, which I intend as a future replacement for Forum-dl. Though I haven't finished or wrote documentation for it yet, it may be of use to you.

If you install it (git clone https://github.com/mikwielgus/skrob, then pip install -e skrob), you should be able to crawl this forum and scrape the text of its posts with the following command:

skrob -n 1 "title; .forumlink::attr(href) -> {title; .topictitle::attr(href) -> {title; .postbody; td.gensmall[width^='100'][align='right'][nowrap='nowrap'] a::attr(href) ->}}" "https://www.t4-forum.pl/index.php"

Unfortunately, this outputs raw line-by-line-streamed XHTML, which requires further processing. But it may be a good start!