Open SpiffyChatterbox opened 5 months ago
I'm working on one as well, and as far as I can tell it's all about cherry picking from similar extractors. I know that's not a good answer, but it's all I've got so far.
I kind of want to rework (and hopefully improve) most of the current extractor infrastructure in v2.0, so writing a guide on how to develop new extractors in the old style is somewhat of a waste I thought and it therefore hasn't happened till now.
Look at merged PRs and commits that add new extractors / support for a new site and adopt their code.
Thank you both for your comments and feedback.
Hey mikf, I totally understand your position, and don't want to detract from your efforts to rework the infrastructure. And normally I would totally agree about the time/effort, it's just that some of the sites I'm looking at are removing content. So the longer it takes me, the less quality downloads I can add to my archive.
I know you have priorities, and am not asking to be one of them. I appreciate the time you've put into this project! I'm going to keep learning and working, and am hoping someone will help answer a few questions so I can keep trying to figure out the current extractors. I'll also gladly migrate to the new format as soon as it comes out.
With that in mind, I'm still open to any guidance and suggestions to help steer my learning.
Input that could help me:
Thanks for any and all direction!
Quick edit; I think I answered one of my own questions.
I'm having trouble naming some pages. Is there a standard nomenclature? I've got sections like: a search that returns a gallery of galleries, a filter that returns a gallery of galleries, a gallery of images, a page with a single image.
I went through a good sampling of the existing Extracts and each seems to have their own taxonomy. Some of the common ones are:
So it seems this isn't something required by the code, just a term you apply to the site. Looks like gallery-dl will support whatever you want to call it.
I went through a good sampling of the existing Extracts and each seems to have their own taxonomy. Some of the common ones are:
- Post/Image/Asset
- Category/Tag/Genre
- User/Creator/Artist
- Gallery/Thread/Collection/Project
- Query/Search Results
So it seems this isn't something required by the code, just a term you apply to the site. Looks like gallery-dl will support whatever you want to call it.
Yes, basically. Usually, the naming in the extractor reflects the nomenclature used by the site the extractor is written for.
I've gotten a Docker image setup with a fork of the code so I can make a change and run it to see it in action.
Not sure for what you would need to use Docker for here. You only need Python, and git (Which is already included if you use something like https://github.com/apps/desktop, which would be the simplest way to do this)
So I'm currently trying to duplicate the directlink extractor into my extractor and tinker with the pattern so that a download for an image uses my extractor instead of the directlink.
Possible, although maybe not the best example to use as a starting point, because the directlink extractor is not really similar to any other extractor.
Any suggestions on cherry picking from similar extractors? How do I figure out if the site I'm looking at is similar to another?
Uh, depends? 😄 To be sure, you would have to show us an example. But basically, is it a booru-like site? Or thread based, like some "chan" site? Does it have an API you plan to use, or do you rather rely on extracting info from HTML etc.
I think I have all of the files figured out, but it doesn't seem to be working. I have it added to scripts/supportedsites.py and extractor/init.py, [..]
Yes, you have to add
modules
in gallery_dl/extractor/__init__.py
, yourextractorfilename.py
must match and yourextractorfilename.py
belongs into gallery_dl/extractor/
This is the necessary part, but you should also add your extractor to scripts/supportedsites.py
and ideally also add a test/results/yourextractorfilename.py
.
Not sure if docs/supportedsites.md
is still necessary. But that can easily be fixed later.
[..] but when I try to download from my site, it still seems to be defaulting to directlink.
If it still defaults to the directlink extractor, your test URL still seems to match the directlink pattern. Your test pattern needs to be unique, it should only match your test input.
You can check this with https://pythex.org/ or https://regex101.com/ (don't forget to set the regex flavor to Python first here)
A somewhat simple example to start would maybe be this one: https://github.com/mikf/gallery-dl/blob/master/gallery_dl/extractor/4chan.py
Thank your for your thoughts and answers, Hrxn!
Not sure for what you would need to use Docker for here. You only need Python, and git
Well, this was the best way I could figure out how to develop. If I installed gallery-dl with pip, then running gallery-dl would use that install, and not my modified/forked version. And if I just cloned it to a new environment, it wasn't "installed" so I couldn't run it.
Docker is how I got a fork that I can modify code and run the modified version. If I'm missing another option that's easier, would love to hear it!
Your test pattern needs to be unique, it should only match your test input.
OK, that is key information that I missed before. Thank you, thank you!
But basically, is it a booru-like site? Or thread based, like some "chan" site? Does it have an API you plan to use, or do you rather rely on extracting info from HTML etc.
Yeah, OK... Just reading these questions is pointing me in a good direction. I'll dig in from here and see where that takes me.
and ideally also add a test/results/yourextractorfilename.py.
Once I add them, how can I trigger the test for that one extractor? Every time I try to use the commands I'm used to, either it tries to run all of the tests or none of them.
And if I just cloned it to a new environment, it wasn't "installed" so I couldn't run it.
You can run Python code from source:
python -m gallery_dl ...
See https://docs.python.org/3/using/cmdline.html#cmdoption-m for details.
Setting or modifying PYTHONPATH
might also be helpful.
Once I add them, how can I trigger the test for that one extractor? Every time I try to use the commands I'm used to, either it tries to run all of the tests or none of them.
By running python test_results.py YOUR_CATEGORY
inside the test
directory.
make test
also includes some extractor-related tests.
docs/supportedsites.md
is just that: documentation. make
(or more specifically scripts/supportedsites.py
) will update this file automatically.
The minimum requirements for an extractor to be recognized is an entry in the modules
list in extractor/init.py and a class with a pattern
attribute in that module. It should inherit from Extractor, but it technically doesn't need to. Next thing is defining an items()
method and yielding messages from it.
from .common import Extractor, Message
from .. import text
class ExampleTestExtractor(Extractor):
category = "example"
subcategory = "test"
pattern = r"(?:https?://)?example\.org"
def items(self):
url = "https://www.iana.org/_img/2022/iana-logo-header.svg"
data = text.nameext_from_url(url)
yield Message.Directory, data
yield Message.Url, url, data
Some simple, albeit older, examples from PRs would be b17e2dcf939e82bb375db0e581daeaf2d3a42b53 and 25297815bcc8609317c4b09378c9c35671259756.
The CatboxFileExtractor
is also quite minimal.
Outstanding! Thanks so much for your help, you've gotten me over several hurdles!
I got it working with a single URL. (Here's my working code if that helps.) Now need to figure out how the GalleryExtractor works.
I see from the Catbox example (and others) that we're returning a dictionary with details in metadata(), and a URL. But I can't tell what is required, and what's extraneous for that example. How can I tell what's necessary for the Album/Gallery extractor to identify links and send to the Extractor? And like, what is the page
parameter for?
I've started putting my lessons learned into a wiki as a draft. Like I said, if you're going to change the way extractors work, then this is just a learning exercise for me. But if that rewrite is a ways out, maybe this can help other noob/part-time developers add some functionality.
I've gotten two going so far, but both are just "single image from a image page." I still need a hand figuring out how a Album/Gallery works. Open to suggestions and feedback!
I am trying to write an extractor which is similar to the imagechest extractor.
I'm able to call the API and get post data, but it aggregates all posts before moving on to grab the data of each post. Ideally I'd like for it to scrape data from the posts in the given range, then move onto the next set of posts using the params.
Sounds like you should be using generators (functions that yield
their results one at a time), but I can't say for certain without seeing the actual code.
...but I can't say for certain without seeing the actual code.
See attached
I tried to clean up your code a bit: https://gist.github.com/mikf/999147ca6c381a067c2d450ac3510ae9 (I might have overdone it, but oh well ...)
This currently only prints IDs of accessible and not accessible posts.
What I've noticed:
cookies.get
returns a plain string which does not provide a .json()
method. You either need to util.jsom_loads()
it or just text.extr()
the access_token like I did.
(Providing OAuth tokens as cookie value is very unconventional. I don't think any other site supported by gallery-dl does this.)items()
doesn't yield
anything__init__
chaining was all over the place
(There's the real Python constructor __init__
, and then there's gallery-dl's extra _init()
which should be used for delayed init)So one thing I'm trying is to incorporate debug logging into my extractors so I can see what they're doing. Example here, line 38 properly outputs the results of data
.
But it doesn't seem to be working when I'm trying with the Gallery Extractor. Example here, lines 24 and 35 where I attempt to see what's happening, but just get a Traceback error.
Any ideas how I can do some debugging and see what's going on?
You need to call __init__()
with the right arguments (or at least define the root
path and capture the rest of the path in the pattern
regex)
diff --git a/gwm.py b/gwm2.py
index a30b76b..d8c0890 100644
--- a/gwm.py
+++ b/gwm2.py
@@ -13,11 +13,14 @@ class GirlsWithMuscleGalleryExtractor(GalleryExtractor):
"""Extractor for catbox albums"""
category = "gwm"
subcategory = "album"
- pattern = BASE_PATTERN + r"/images/\?name=[\w\s%]*"
+ pattern = BASE_PATTERN + r"/images/\?name=([^&#]+)"
filename_fmt = "{filename}.{extension}" # Not sure if this is used?
directory_fmt = ("{category}", "{album_name} ({album_id})") # Not sure if this is used?
archive_fmt = "{album_id}_{filename}" # Not sure if this is used?
+ def __init__(self, match):
+ url = "https://www.girlswithmuscle.com/images/?name=" + match.group(1)
+ GalleryExtractor.__init__(self, match, url)
def metadata(self, page):
extr = text.extract_from(page)
Instead of log.debug()
you could also use regular "debug" print()
s or self._dump(...)
.
Thanks again mikf!
Quick update; I now have it tentatively working. I certainly need to do more testing before submitting a PR, but did some debug stepping and understand a lot more about what's going on.
I saw in various issues that you welcome documentation submissions, so I did some rewriting of the Wiki. It didn't do like a PR and ask for permission, it just allowed me to change it. I hope I didn't overstep. And feel free to let me know if I'm going in a bad direction, I'll be happy to rework something.
Also, if you're OK with it, I'd like to work on a detailed docstring PR for common.Extractor() and GalleryExtractor(). I think an explanation in there could help new developers understand what's going on and make it easier to spread the load of the extractor work. Something along the lines of the extractor comments in youtube-dl.
OK, I switched to working on another extractor and it's helping me see what I did wrong with my first pass at the documentation. So things are coming along in that department, though it is slow.
Where I could use help:
I have the Gallery extraction working, but only on the first page. How do I get it to recognize that there's a 2nd page and keep iterating?
I think I'm good with this now. I see a new PR for the same site I was working on (https://github.com/mikf/gallery-dl/pull/6016) and hunter-gatherer8 got the pages()
method that answers my question.
I will continue working through more debugging and documentation and check back later, thanks!
Nice, that's good to hear.
So where exactly would you dump json data for each item?
Replace print(post["id"])
with self._dump(post)
https://gist.github.com/mikf/999147ca6c381a067c2d450ac3510ae9#file-boosty-py-L35
and maybe change the continue
on the next line to a return
or exit()
.
Hey all!
I'd like to work on adding a new supported site. However, it's unclear to someone with my skill level how to do that.
I can write a web crawler, so am comfortable with using requests and BeautifulSoup, but don't know how to take that knowledge and integrate with the gallery-dl classes.
If someone would be willing to jot down some notes and/or answer some questions, I'd be happy to write the steps out long form so it could be added to the wiki.
When you find a new site and want to extend, what do you do first? What info do you need from the site to create an extractor?
If this is the wrong place to ask, feel free to let me know a better place!