Documentation: How to develop new supported site

SpiffyChatterbox commented 5 months ago

Hey all!

I'd like to work on adding a new supported site. However, it's unclear to someone with my skill level how to do that.

I can write a web crawler, so am comfortable with using requests and BeautifulSoup, but don't know how to take that knowledge and integrate with the gallery-dl classes.

If someone would be willing to jot down some notes and/or answer some questions, I'd be happy to write the steps out long form so it could be added to the wiki.

When you find a new site and want to extend, what do you do first? What info do you need from the site to create an extractor?

If this is the wrong place to ask, feel free to let me know a better place!

biggestsonicfan commented 5 months ago

I'm working on one as well, and as far as I can tell it's all about cherry picking from similar extractors. I know that's not a good answer, but it's all I've got so far.

mikf commented 5 months ago

I kind of want to rework (and hopefully improve) most of the current extractor infrastructure in v2.0, so writing a guide on how to develop new extractors in the old style is somewhat of a waste I thought and it therefore hasn't happened till now.

Look at merged PRs and commits that add new extractors / support for a new site and adopt their code.

SpiffyChatterbox commented 4 months ago

Thank you both for your comments and feedback.

Hey mikf, I totally understand your position, and don't want to detract from your efforts to rework the infrastructure. And normally I would totally agree about the time/effort, it's just that some of the sites I'm looking at are removing content. So the longer it takes me, the less quality downloads I can add to my archive.

I know you have priorities, and am not asking to be one of them. I appreciate the time you've put into this project! I'm going to keep learning and working, and am hoping someone will help answer a few questions so I can keep trying to figure out the current extractors. I'll also gladly migrate to the new format as soon as it comes out.

With that in mind, I'm still open to any guidance and suggestions to help steer my learning.

I've gotten a Docker image setup with a fork of the code so I can make a change and run it to see it in action.
I've looked at a lot of PRs for extractors, but don't fully understand what's going on well enough to really "learn."
I'm guessing that it's a Factory design pattern, maybe? So that any classes in the extractor that inherit the Extractor function get called, as long as their regex matches the pattern.
So I'm currently trying to duplicate the directlink extractor into my extractor and tinker with the pattern so that a download for an image uses my extractor instead of the directlink.

Input that could help me:

Any suggestions on cherry picking from similar extractors? How do I figure out if the site I'm looking at is similar to another?
I think I have all of the files figured out, but it doesn't seem to be working. I have it added to scripts/supportedsites.py and extractor/init.py, but when I try to download from my site, it still seems to be defaulting to directlink.
I can't figure out how to use the tests. Is there a way to unittest one extractor?

Thanks for any and all direction!

SpiffyChatterbox commented 4 months ago

Quick edit; I think I answered one of my own questions.

I'm having trouble naming some pages. Is there a standard nomenclature? I've got sections like: a search that returns a gallery of galleries, a filter that returns a gallery of galleries, a gallery of images, a page with a single image.

I went through a good sampling of the existing Extracts and each seems to have their own taxonomy. Some of the common ones are:

Post/Image/Asset
Category/Tag/Genre
User/Creator/Artist
Gallery/Thread/Collection/Project
Query/Search Results

So it seems this isn't something required by the code, just a term you apply to the site. Looks like gallery-dl will support whatever you want to call it.

Hrxn commented 4 months ago

I went through a good sampling of the existing Extracts and each seems to have their own taxonomy. Some of the common ones are:

Post/Image/Asset

Category/Tag/Genre

User/Creator/Artist

Gallery/Thread/Collection/Project

Query/Search Results

So it seems this isn't something required by the code, just a term you apply to the site. Looks like gallery-dl will support whatever you want to call it.

Yes, basically. Usually, the naming in the extractor reflects the nomenclature used by the site the extractor is written for.

I've gotten a Docker image setup with a fork of the code so I can make a change and run it to see it in action.

Not sure for what you would need to use Docker for here. You only need Python, and git (Which is already included if you use something like https://github.com/apps/desktop, which would be the simplest way to do this)

So I'm currently trying to duplicate the directlink extractor into my extractor and tinker with the pattern so that a download for an image uses my extractor instead of the directlink.

Possible, although maybe not the best example to use as a starting point, because the directlink extractor is not really similar to any other extractor.

Any suggestions on cherry picking from similar extractors? How do I figure out if the site I'm looking at is similar to another?

Uh, depends? 😄 To be sure, you would have to show us an example. But basically, is it a booru-like site? Or thread based, like some "chan" site? Does it have an API you plan to use, or do you rather rely on extracting info from HTML etc.

I think I have all of the files figured out, but it doesn't seem to be working. I have it added to scripts/supportedsites.py and extractor/init.py, [..]

Yes, you have to add

your extractor to the modules in gallery_dl/extractor/__init__.py,
your extractor yourextractorfilename.py must match and
yourextractorfilename.py belongs into gallery_dl/extractor/

This is the necessary part, but you should also add your extractor to scripts/supportedsites.py and ideally also add a test/results/yourextractorfilename.py.

Not sure if docs/supportedsites.md is still necessary. But that can easily be fixed later.

[..] but when I try to download from my site, it still seems to be defaulting to directlink.

If it still defaults to the directlink extractor, your test URL still seems to match the directlink pattern. Your test pattern needs to be unique, it should only match your test input.

You can check this with https://pythex.org/ or https://regex101.com/ (don't forget to set the regex flavor to Python first here)

A somewhat simple example to start would maybe be this one: https://github.com/mikf/gallery-dl/blob/master/gallery_dl/extractor/4chan.py

SpiffyChatterbox commented 4 months ago

Thank your for your thoughts and answers, Hrxn!

Not sure for what you would need to use Docker for here. You only need Python, and git

Well, this was the best way I could figure out how to develop. If I installed gallery-dl with pip, then running gallery-dl would use that install, and not my modified/forked version. And if I just cloned it to a new environment, it wasn't "installed" so I couldn't run it.

Docker is how I got a fork that I can modify code and run the modified version. If I'm missing another option that's easier, would love to hear it!

Your test pattern needs to be unique, it should only match your test input.

OK, that is key information that I missed before. Thank you, thank you!

But basically, is it a booru-like site? Or thread based, like some "chan" site? Does it have an API you plan to use, or do you rather rely on extracting info from HTML etc.

Yeah, OK... Just reading these questions is pointing me in a good direction. I'll dig in from here and see where that takes me.

and ideally also add a test/results/yourextractorfilename.py.

Once I add them, how can I trigger the test for that one extractor? Every time I try to use the commands I'm used to, either it tries to run all of the tests or none of them.

mikf commented 4 months ago

And if I just cloned it to a new environment, it wasn't "installed" so I couldn't run it.

You can run Python code from source:

python -m gallery_dl ...

See https://docs.python.org/3/using/cmdline.html#cmdoption-m for details. Setting or modifying PYTHONPATH might also be helpful.

Once I add them, how can I trigger the test for that one extractor? Every time I try to use the commands I'm used to, either it tries to run all of the tests or none of them.

By running python test_results.py YOUR_CATEGORY inside the test directory. make test also includes some extractor-related tests.

docs/supportedsites.md

is just that: documentation. make (or more specifically scripts/supportedsites.py) will update this file automatically.

The minimum requirements for an extractor to be recognized is an entry in the modules list in extractor/init.py and a class with a pattern attribute in that module. It should inherit from Extractor, but it technically doesn't need to. Next thing is defining an items() method and yielding messages from it.

from .common import Extractor, Message
from .. import text

class ExampleTestExtractor(Extractor):
    category = "example"
    subcategory = "test"
    pattern = r"(?:https?://)?example\.org"

    def items(self):
        url = "https://www.iana.org/_img/2022/iana-logo-header.svg"
        data = text.nameext_from_url(url)
        yield Message.Directory, data
        yield Message.Url, url, data

Some simple, albeit older, examples from PRs would be b17e2dcf939e82bb375db0e581daeaf2d3a42b53 and 25297815bcc8609317c4b09378c9c35671259756. The CatboxFileExtractor is also quite minimal.

SpiffyChatterbox commented 4 months ago

Outstanding! Thanks so much for your help, you've gotten me over several hurdles!

I got it working with a single URL. (Here's my working code if that helps.) Now need to figure out how the GalleryExtractor works.

I see from the Catbox example (and others) that we're returning a dictionary with details in metadata(), and a URL. But I can't tell what is required, and what's extraneous for that example. How can I tell what's necessary for the Album/Gallery extractor to identify links and send to the Extractor? And like, what is the page parameter for?

I've started putting my lessons learned into a wiki as a draft. Like I said, if you're going to change the way extractors work, then this is just a learning exercise for me. But if that rewrite is a ways out, maybe this can help other noob/part-time developers add some functionality.

SpiffyChatterbox commented 4 months ago

I've gotten two going so far, but both are just "single image from a image page." I still need a hand figuring out how a Album/Gallery works. Open to suggestions and feedback!

biggestsonicfan commented 4 months ago

I am trying to write an extractor which is similar to the imagechest extractor.

I'm able to call the API and get post data, but it aggregates all posts before moving on to grab the data of each post. Ideally I'd like for it to scrape data from the posts in the given range, then move onto the next set of posts using the params.

mikf commented 4 months ago

Sounds like you should be using generators (functions that yield their results one at a time), but I can't say for certain without seeing the actual code.

biggestsonicfan commented 4 months ago

...but I can't say for certain without seeing the actual code.

See attached

mikf commented 4 months ago

I tried to clean up your code a bit: https://gist.github.com/mikf/999147ca6c381a067c2d450ac3510ae9 (I might have overdone it, but oh well ...)

This currently only prints IDs of accessible and not accessible posts.

What I've noticed:

Your cookies/access_token code doesn't work the way it is. cookies.get returns a plain string which does not provide a .json() method. You either need to util.jsom_loads() it or just text.extr() the access_token like I did. (Providing OAuth tokens as cookie value is very unconventional. I don't think any other site supported by gallery-dl does this.)
Your items() doesn't yield anything
Extractor inheritance and __init__ chaining was all over the place (There's the real Python constructor __init__, and then there's gallery-dl's extra _init() which should be used for delayed init)

SpiffyChatterbox commented 3 months ago

So one thing I'm trying is to incorporate debug logging into my extractors so I can see what they're doing. Example here, line 38 properly outputs the results of data.

But it doesn't seem to be working when I'm trying with the Gallery Extractor. Example here, lines 24 and 35 where I attempt to see what's happening, but just get a Traceback error.

Any ideas how I can do some debugging and see what's going on?

mikf commented 3 months ago

You need to call __init__() with the right arguments (or at least define the root path and capture the rest of the path in the pattern regex)

diff --git a/gwm.py b/gwm2.py
index a30b76b..d8c0890 100644
--- a/gwm.py
+++ b/gwm2.py
@@ -13,11 +13,14 @@ class GirlsWithMuscleGalleryExtractor(GalleryExtractor):
     """Extractor for catbox albums"""
     category = "gwm"
     subcategory = "album"
-    pattern = BASE_PATTERN + r"/images/\?name=[\w\s%]*"
+    pattern = BASE_PATTERN + r"/images/\?name=([^&#]+)"
     filename_fmt = "{filename}.{extension}" # Not sure if this is used?
     directory_fmt = ("{category}", "{album_name} ({album_id})") # Not sure if this is used?
     archive_fmt = "{album_id}_{filename}" # Not sure if this is used?

+    def __init__(self, match):
+        url = "https://www.girlswithmuscle.com/images/?name=" + match.group(1)
+        GalleryExtractor.__init__(self, match, url)

     def metadata(self, page):
         extr = text.extract_from(page)

Instead of log.debug() you could also use regular "debug" print()s or self._dump(...).

SpiffyChatterbox commented 3 months ago

Thanks again mikf!

Quick update; I now have it tentatively working. I certainly need to do more testing before submitting a PR, but did some debug stepping and understand a lot more about what's going on.

I saw in various issues that you welcome documentation submissions, so I did some rewriting of the Wiki. It didn't do like a PR and ask for permission, it just allowed me to change it. I hope I didn't overstep. And feel free to let me know if I'm going in a bad direction, I'll be happy to rework something.

Also, if you're OK with it, I'd like to work on a detailed docstring PR for common.Extractor() and GalleryExtractor(). I think an explanation in there could help new developers understand what's going on and make it easier to spread the load of the extractor work. Something along the lines of the extractor comments in youtube-dl.

SpiffyChatterbox commented 3 months ago

OK, I switched to working on another extractor and it's helping me see what I did wrong with my first pass at the documentation. So things are coming along in that department, though it is slow.

Where I could use help:

I have the Gallery extraction working, but only on the first page. How do I get it to recognize that there's a 2nd page and keep iterating?

SpiffyChatterbox commented 3 months ago

I think I'm good with this now. I see a new PR for the same site I was working on (https://github.com/mikf/gallery-dl/pull/6016) and hunter-gatherer8 got the pages() method that answers my question.

I will continue working through more debugging and documentation and check back later, thanks!

Hrxn commented 3 months ago

Nice, that's good to hear.

biggestsonicfan commented 1 month ago

So where exactly would you dump json data for each item?

mikf commented 1 month ago

Replace print(post["id"]) with self._dump(post) https://gist.github.com/mikf/999147ca6c381a067c2d450ac3510ae9#file-boosty-py-L35 and maybe change the continue on the next line to a return or exit().

mikf / gallery-dl

Documentation: How to develop new supported site #5750