mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.87k stars 976 forks source link

Help in moving from kemono-dl to gallery-dl (implementation of features) #5446

Closed DrQuantum101 closed 6 months ago

DrQuantum101 commented 7 months ago

As the title suggests I'm moving from the antiquated kemono-dl to the better-upkept gallery-dl. I have run into two issues:

  1. Comment Extraction
  2. Link Extraction

The postprocessor I have setup so far is as follows:

            "postprocessors":[
                {
                    "name":"metadata",
                    "filename":"[{id}] Metadata.json"
                },
                {
            "name": "metadata",
            "filename": "[{id}] Links.txt",
            "filter": "embed.get('url') or re.search(r'(?i)(redgifs|atomicloli|gfycat|google|drive|onedrive|1drv|mega|xgf|k00|koofr|gigafile|mediafire|porn3dx|gofile|dropbox)', content)",
            "mode": "custom",
            "format": "{embed[url]:?/\n/}"
                },
                {
                    "name":"metadata",
                    "mode":"custom",
                    "filename":"[{id}] Details.html",
                    "extension":"html",
                    "format":"<h1 style='display: inline'><a href='https://kemono.su/{service}/user/{user}/post/{id}'>{title}</a></h1> by <a href='https://kemono.su/{service}/user/{user}'>{username}</a><div><br></div><div class='content'><b>Text Content:</b><br>{content}</div><br><hr><div class='content'><b>Poll:</b><br>{poll}</div><br><hr><div class='content'><b>Comments:</b><br>{comments}</div><br><div><hr><div class='tags'>[\"{tags:J\", \"}\"]</div><hr><div class='tags'>Number of Files: {count:0>3} | <a href='https://kemono.su/{service}/user/{user}/post/{prev}'>Previous [{prev}]</a> | <a href='https://kemono.su/{service}/user/{user}/post/{next}'>Next [{next}]</a></div><hr></div><div>Posted: {date:%Y-%m-%d} @ {date:%H:%M:%S}</div><div>Saved: {_now:%Y-%m-%d} @ {_now:%H:%M:%S}</div><br>\n\n"
                }
            ]

For comment extraction, using just {comments} variable dumps the json data into the .html file unformatted. The link extractor was taken from #3644 which seems to no longer work. kemono-dl extracted href links and put them in a links.txt file and formatted comments in an html as such:

<div class="post__comments">
  <article class="comment" id="{comments[N]['id']}">
    <header class="comment__header">
      <a class="fancy-link fancy-link--local comment__name" href="#{comments[N]['id']}">{comments[N]['user']}</a>
    </header>
    <section class="comment__body">
      <p class="comment__message">{comments[N]['body']}</p>
    </section>
    <footer class="comment__footer">
      <time class="timestamp" datetime="{comments[N]['date']}">{comments[N]['date']}</time>
    </footer>
  </article>
</div>

I'm quite sure it will be possible to implement these two features using the post-processor options of gallery-dl but I am not well-versed enough in html or json formats to do so on my own. Any assistance will be greatly appreciated

Hrxn commented 7 months ago

Where's that HTML template in your example from? It has to come from somewhere, but your config does not indicate anything..

DrQuantum101 commented 7 months ago

Where's that HTML template in your example from? It has to come from somewhere, but your config does not indicate anything..

The HTML template at the bottom is how kemono-dl displays comments in its generated content.html when --content & --comments options are passed. I have replaced the actual comment data from an example downloaded post that I had saved and replaced it with the corresponding variables of the comment array indicated by the -K of gallery-dl.

I have been so far unable to implement the comment template into my config post-processor. As you can see the Details.html of the postprocessor just puts in the entire {comments} data without formatting. Below is the original unmodified comment from the content.html:

<div class="post__comments">
 <article class="comment" id="126126417">
  <header class="comment__header">
   <a class="fancy-link fancy-link--local comment__name" href="#126126417">
    Anonymous
   </a>
  </header>
  <section class="comment__body">
   <p class="comment__message">
    WELL NOW I NEED THE LINK PERSPECTIVE AAA
   </p>
  </section>
  <footer class="comment__footer">
   <time class="timestamp" datetime="2024-01-04 14:07:18.497000">
    2024-01-04 14:07:18.497000
   </time>
  </footer>
 </article>
 </div>
Hrxn commented 7 months ago

You should set-up your postprocessors for comment and link extraction like that, for example:

"postprocessors":[
    {
        "name":"metadata",
        "filename":"[{id}] Metadata.json"
    },
    {
        "name": "metadata",
        "event": "post",
        "filename": "[{id}] Links.txt",
        "filter": "embed.get('url') or re.search(r'(?i)(redgifs|atomicloli|gfycat|google|drive|onedrive|1drv|mega|xgf|k00|koofr|gigafile|mediafire|porn3dx|gofile|dropbox)', content)",
        "mode": "custom",
        "format": "{embed[url]}"
    },
    {
        "name":"metadata",
        "event": "post",
        "mtime": true,
        "mode": "custom",
        "filename":"[{id}] Details.html",
        "content-format": "\fT ~/gallery-dl/format/kemono-comment-template.html"
    }
]

The link extractor you've mentioned earlier should still work (maybe try it with "event": "post", like in the example above), and the important change I've made here is to use "content-format" with a special format string (\fT), so that the actual format used comes from the linked template file here.

The content of said template file can now be something like this (which is taken from the "format" setting in your initial comment, with some modifications):

<div>
<h1 style='display: inline'><a href='https://kemono.su/{service}/user/{user}/post/{id}'>{title}</a> by <a href='https://kemono.su/{service}/user/{user}'>{username}</a> </h1>
   <br>
</div>
<div class='content'><b>Text Content:</b>
    <br>{content}</div>
<br>
<hr>
<div class='content'><b>Poll:</b>
    <br>{poll}</div>
<br>
<hr>
<div class='content'><b>Comments:</b>
    <br>{comments}</div>
<br>
<div>
    <hr>
    <div class='tags'>[\"{tags:J\", \"}\"]</div>
    <hr>
    <div class='tags'>Number of Files: {count:0>3} | <a href='https://kemono.su/{service}/user/{user}/post/{prev}'>Previous [{prev}]</a> | <a href='https://kemono.su/{service}/user/{user}/post/{next}'>Next [{next}]</a></div>
    <hr>
</div>
<div>Posted: {date:%Y-%m-%d} @ {date:%H:%M:%S}</div>
<div>Saved: {_now:%Y-%m-%d} @ {_now:%H:%M:%S}</div>
<br><br>

You've already checked the output/available metadata with -K, so you know the name of the variables etc. But gallery-dl does not create any HTML for you, you have to set it up for yourself, however you like it. But you only have to do it once.

DrQuantum101 commented 7 months ago

This method makes it easier to edit the .html format, but unfortunately, the issues remain unresolved. The link extractor is not extracting any links, {embed[url]}, as it only contains 'None'. I will modify the regular expression to fix this and continue testing on other links. The same mentioned before where the {comments} content is being pasted also remains.

I could fix the comments .html if I knew how to use the metadata variables properly. For example {comments} is the entire block of data containing every comment, so the issue I'm facing is to be expected, I tried implementing {comment[N]['body]} and the other related arrays that store each part of the comments but the syntax is not proper. If I wanted each comment to be:

{comment[N]['user']} - #{comment[N]['id']}
{comment[N]['body']}
{comment[N]['date']}

I would have to assign N manually, which is impossible for this. If this were normal Python code, I would increment N until the N value of the comment array has been exhausted (along the lines of below), but I'm not sure if that is possible.

for N, comment in enumerate(comments, start=0):
    print(f"{comments[N]['user']} - #{comments[N]['id']}")
    print(comments[N]['body'])
    print(comments[N]['date'])
    print("\n")
mikf commented 7 months ago

If this were normal Python code

There is a python post processor, which allows you to call a normal Python function with the metadata dict as argument

    {
        "name": "python",
        "event": "post",
        "function": "gdl_utils:kemono_comments"
    }
def kemono_comments(metadata):
    comments = []

    for comment in metadata["comments"]:
        comments.append(f"""\
<div class="post__comments">
 <article class="comment" id="{comment["id"]}">
  <header class="comment__header">
   <a class="fancy-link fancy-link--local comment__name" href="#126126417">
    {comment["user"]}
   </a>
  </header>
  <section class="comment__body">
   <p class="comment__message">
    {comment["body"]}
   </p>
  </section>
  <footer class="comment__footer">
   <time class="timestamp" datetime="{comment["date"]}">
    {comment["date"]}
   </time>
  </footer>
 </article>
 </div>
""")

    with open(…) as fp:
        fp.write("\n".join(comments))

You could do something similar for extracting links from metadata["content"].

DrQuantum101 commented 7 months ago

Perfect, thank you! This is the solution I've been looking for. I'm glad that I decided to move over entirely to this tool. I have left my python script and kemono config for anyone needing this thread in the future.

Kemono Config:

        "kemonoparty":{
            "cookies":[
                "firefox"
            ],
            "comments": "true",
            "dms": "true",
            "announcements": "true",
            "metadata": "true",
            "archive-format":"{service}_{user}_{id}_{title}_{num}_{hash}",
            "archive":"A:/Miscalaneous/Backup Files/Homework/Lenny Face/Images & Comics/gallery-dl/Archives/kemono-archive.sqlite3",
            "base-directory":"A:/Miscalaneous/Backup Files/Homework/Lenny Face/Images & Comics/gallery-dl/Downloads/kemonoparty",
            "directory":[
                "{subcategory}",
                "{username} [{user}]",
                "[{id}] {title}"
            ],
            "filename":"[{id}] File_{num:0>3} - {filename}.{extension}",
            "retries":-1,
            "retry-codes":[
                429,
                430
            ],
            "discord":{
                "#":"discord-specific settings",
                "archive-format":"{subcategory}_{server}_{channel}_{id}_{num}_{hash}",
                "archive":"A:/Miscalaneous/Backup Files/Homework/Lenny Face/Images & Comics/gallery-dl/Archives/kemono-discord-archive.sqlite3",
                "directory":[
                    "{subcategory}",
                    "Server #{server}",                    
                    "[{channel}] {channel_name[:25]} ",
                    "[{date!s:.10}] [{id}]"
                ],
                "filename":"[{id}] {filename[:10]}_{num:0>3}.{extension}",
                "postprocessors":[
                    {
                        "name":"metadata",
                        "mode":"custom",
                        "filename":"[{id}] Details.html",
                        "extension":"html",
                        "format":"<h1 style='display: inline'><a href='https://kemono.su/discord/server/{server}#{channel}'>Post of type [{type}] in [#{channel_name}]</a></h1> posted by <a href='https://kemono.su/discord/server/{server}#{channel}'>{author[username]}</a><div><br></div><div class='content'><b>Text:</b><br>{content}</div><br><div><hr><div class='tags'>[User #{author['id']} || Server #{server} || Channel #{channel} || Post #{id}]</div><hr></div><div>Posted: {date:%Y-%m-%d} @ {date:%H:%M:%S}</div><div>Saved: {_now:%Y-%m-%d} @ {_now:%H:%M:%S}</div><br>\n\n"
                    },
                    {
                        "name":"metadata",
                        "filename":"[{id}] Metadata.json"
                    }
                ]
            },
            "postprocessors":[
                {
                    "name":"metadata",
                    "filename":"[{id}] Metadata.json"
                },
                {
                    "name": "python",
                    "event": "post-after",
                    "function": "./gdl_utils:kemono_details"
                }
            ]
        }

gdl_utils.py (located next to gallery-dl.exe):

def kemono_details(metadata):
    from datetime import datetime
    import re
    import sys

    sys.path.insert(1, './')

    import html2text
    text_maker = html2text.HTML2Text()
    text_maker.ignore_links = True
    text_maker.protect_links = True

    def extract_links(text_block):
        # Find all URLs in the text block
        urls = re.findall(r'http[s]?://(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,7}(?:\/\S*)*(?:#[^\s]*)?', text_block)
        # Remove any HTML tags from the URLs
        urls = [re.sub(r'<[^>]*>', '', url) for url in urls]
        # Remove leading and trailing quotations
        urls = [url.strip('"') for url in urls]

        # Remove duplicates
        unique_urls = set(urls)

        formatted_links = []
        for link in unique_urls:
            formatted_links.append(f'<a href="{link}" rel="nofollow noopener" target="_blank">{link}</a><br>')

        formatted_links_html = ''.join(formatted_links)
        return formatted_links_html

    formatted_embeds_html = f'<a href="{metadata["embed"].get("url")}" rel="nofollow noopener" target="_blank">{metadata["embed"].get("url")}</a><br>'

    current_time = datetime.now()
    formatted_date = current_time.strftime('%Y-%m-%d') 
    formatted_time = current_time.strftime('%H:%M:%S') 

    def kemono_comments(metadata):
        comments = []

        for comment in metadata["comments"]:
            comments.append(f"""\
    <div class="post__comments">
    <article class="comment" id="{comment["id"]}">
    <header class="comment__header">
    <a class="fancy-link fancy-link--local comment__name" href="#126126417">
        {comment["user"]}
    </a>
    &nbsp;&nbsp;<font COLOR="#FF0000">#{comment["id"]}</font> 
    </header>
    <section class="comment__body">
    <p class="comment__message">
        {comment["body"]}
    </p>
    </section>
    <footer class="comment__footer">
    <time class="timestamp" datetime="{comment["date"]}">
        <font COLOR="#01a049">{comment["date"]}</font>
    </time>
    </footer>
    </article>
    </div><br><br>
    """)

        comment_joined = "\n".join(comments)
        return comment_joined

    html_content = f"""\
<h1 style='display: inline'>
  <a href='https://kemono.su/{metadata["service"]}/user/{metadata["user"]}/post/{metadata["id"]}'>{metadata["title"]}</a>
</h1> by <a href='https://kemono.su/{metadata["service"]}/user/{metadata["user"]}'>{metadata["username"]} [{metadata["subcategory"].capitalize()}]</a>
<div>
  <br>
</div>
<div class='content'>
  <b>Text Content:</b>
  <br><br>{metadata["content"]}
</div>
<br>
<hr>
<div class='content'>
  <b>Links (Text):</b>
  <br><br>{extract_links(metadata["content"])}
  <br>
  <b>Links (Embeds):</b>
  <br><br>{formatted_embeds_html}
</div>
<br>
<hr>
<div class='content'>
  <b>Poll:</b>
  <br><br>{metadata["poll"]}
</div>
<br>
<hr>
<div class='content'>
  <b>Comments:</b>
  <br><br>{kemono_comments(metadata)}
</div>
<br>
<div>
  <hr>
  <div class='tags'>Number of Files: {metadata["count"]:0>3} | <a href='https://kemono.su/{metadata["service"]}/user/{metadata["user"]}/post/{metadata["prev"]}'>Previous [{metadata["prev"]}]</a> | <a href='https://kemono.su/{metadata["service"]}/user/{metadata["user"]}/post/{metadata["next"]}'>Next [{metadata["next"]}]</a>
  </div>
  <hr>
</div>
<div>Posted: {metadata["date"]:%Y-%m-%d} @ {metadata["date"]:%H:%M:%S}</div>
<div>Saved: {formatted_date} @ {formatted_time}</div>
<br>\n\n
"""

    with open(f"./Downloads/kemonoparty/{metadata['subcategory']}/{metadata['username']} [{metadata['user']}]/[{metadata['id']}] {metadata['title']}/[{metadata['id']}] Details.html", "w", encoding="utf-8") as fp:
      fp.write(html_content)

    with open(f"./Downloads/kemonoparty/{metadata['subcategory']}/{metadata['username']} [{metadata['user']}]/[{metadata['id']}] {metadata['title']}/[{metadata['id']}] Details.txt", "w", encoding="utf-8") as f:
      f.write(text_maker.handle(html_content))
a84r7a3rga76fg commented 6 months ago

Doesn't work.

pip install html2text
Requirement already satisfied: html2text in c:\users\Administrator\appdata\local\programs\python\python312\lib\site-packages (2024.2.26)
            "postprocessors": [
            {
                "name":"metadata",
                "filename":"[{id}] Metadata.json"
            },
            {
                "name": "python",
                "event": "post-after",
                "function": "C:/Users/Administrator/gallery-dl/gdl_utils.py:kemono_details"
            }
[kemonoparty][error] An unexpected error occurred: ModuleNotFoundError - No module named 'html2text'. Please run gallery-dl again with the --verbose flag, copy its output and report this issue on https://github.com/mikf/gallery-dl/issues .
Traceback (most recent call last):
  File "__main__.py", line 20, in <module>
  File "gallery_dl\__init__.pyc", line 277, in main
  File "gallery_dl\job.pyc", line 158, in run
  File "gallery_dl\job.pyc", line 433, in handle_finalize
  File "gallery_dl\postprocessor\python.pyc", line 36, in run
  File "C:\Users/Administrator/gallery-dl\gdl_utils.py", line 8, in kemono_details
    import html2text

This also didn't work:

            "postprocessors": [
            {
                "name":"metadata",
                "filename":"[{id}] Metadata.json"
            },
            {
                "name": "python",
                "event": "post-after",
                "function": "./gdl_utils:kemono_details"
            }
[postprocessor][error] 'python' initialization failed:  ModuleNotFoundError: No module named 'gdl_utils'
[postprocessor][debug]
Traceback (most recent call last):
  File "gallery_dl\job.pyc", line 583, in initialize
  File "gallery_dl\postprocessor\python.pyc", line 22, in __init__
  File "gallery_dl\util.pyc", line 599, in import_file
ModuleNotFoundError: No module named 'gdl_utils'
[kemonoparty][debug] Active postprocessor modules: [MetadataPP, MetadataPP, ExecPP]
DrQuantum101 commented 6 months ago

Hey, I am working on a better version of the code to extract links using the urlextract library instead of regex, but if you want to do tinkering yourself, I'll explain how to solve each issue.

  1. For some reason, the Python post postprocessor has an issue accessing the PATH to figure out where pip packages are, to solve this:

This is fine if you're only going to use html2text because it doesn't have dependencies, but with other packages, it can get messy fast, so I'm in the process of testing and switching to the method below

  1. To solve this, make sure gdl_utils.py is in the directory as gallery-dl.exe

Below are the sofar UNTESTED new post processor code and calls I am using:

gdl_utils.py

def kemono_details(metadata):
    from datetime import datetime
    import re
    import sys
    import os

    # Replace username with proper directory path
    sys.path.insert(1, 'C:/Users/{username}/AppData/Local/Programs/Python/Python311/Lib/site-packages')
    sys.path.insert(2, 'C:/Users/{username}/AppData/Local/Programs/Python/Python311/Lib')

    import html2text
    from urlextract import URLExtract

    extractor = URLExtract()

    text_maker = html2text.HTML2Text()
    text_maker.ignore_links = True
    text_maker.protect_links = True
    text_maker.wrap_links = False

    link_finder = html2text.HTML2Text()
    link_finder.protect_links = True
    link_finder.wrap_links = False

    def extract_links(text_block):

        content_links = link_finder.handle(text_block)
        content_text = text_maker.handle(text_block)
        # print (content_text)

        # Find all URLs in the text block excluding Twitter, Discord, and Discord CDN links
        urls_1 = extractor.find_urls(content_links)
        urls_2 = extractor.find_urls(content_text)
        urls = urls_1 + urls_2

        # Remove duplicates
        unique_urls = set(urls)

        # Alpha Sort
        unique_urls = sorted(unique_urls)

        # Define patterns to filter out
        filters = ['https://downloads.fanbox.cc/', 
                   'https://cdn.discordapp.com/']

        unique_urls = [url for url in unique_urls if not any(filter in url for filter in filters)]

        if not unique_urls:
            unique_urls = ["None"]
        else:
            unique_urls = list(unique_urls)

        return unique_urls

    formatted_links = []
    text_links = extract_links(metadata["content"])
    for link in text_links:
        formatted_links.append(f'<a href="{link}" rel="nofollow noopener" target="_blank">{link}</a><br>')
    formatted_links_html = ''.join(formatted_links)

    embed_links = metadata["embed"].get("url")
    formatted_embed_html = f'<a href="{embed_links}" rel="nofollow noopener" target="_blank">{embed_links}</a><br>'

    current_time = datetime.now()
    formatted_date = current_time.strftime('%Y-%m-%d') 
    formatted_time = current_time.strftime('%H:%M:%S') 

    def kemono_comments(metadata):
        comments = []

        for comment in metadata["comments"]:
            comments.append(f"""\
              <div class="post__comments">
              <article class="comment" id="{comment["id"]}">
              <header class="comment__header">
              <a class="fancy-link fancy-link--local comment__name" href="#{comment["id"]}">
                  {comment["user"]}
              </a>
              &nbsp;&nbsp;<font COLOR="#FF0000">#{comment["id"]}</font> 
              </header>
              <section class="comment__body">
              <p class="comment__message">
                  {comment["body"]}
              </p>
              </section>
              <footer class="comment__footer">
              <time class="timestamp" datetime="{comment["date"]}">
                  <font COLOR="#01a049">{comment["date"]}</font>
              </time>
              </footer>
              </article>
              </div><br><br>
              """)

        comment_joined = "\n".join(comments)
        return comment_joined

    html_textcontent = metadata["content"].replace("\n", "<br>")
    html_content = (f"""\
      <h1 style='display: inline'>
        <a href='https://kemono.su/{metadata["service"]}/user/{metadata["user"]}/post/{metadata["id"]}'>{metadata["title"]}</a>
      </h1> by <a href='https://kemono.su/{metadata["service"]}/user/{metadata["user"]}'>{metadata["username"]} [{metadata["subcategory"].capitalize()}]</a>
      <div>
        <br>
      </div>
      <div class='content'>
        <b>Text Content:</b>
        <br><br>{html_textcontent}
      </div>
      <br>
      <hr>
      <div class='content'>
        <b>Links (Text):</b>
        <br><br>{formatted_links_html}
        <br>
        <b>Links (Embeds):</b>
        <br><br>{formatted_embed_html}
      </div>
      <br>
      <hr>
      <div class='content'>
        <b>Comments:</b>
        <br><br>{kemono_comments(metadata)}
      </div>
      <br>
      <div>
        <hr>
        <div class='tags'>Number of Files: {metadata["count"]:0>3}
        </div>
        <hr>
      </div>
      <div>Posted: {metadata["date"]:%Y-%m-%d} @ {metadata["date"]:%H:%M:%S}</div>
      <div>Saved: {formatted_date} @ {formatted_time}</div>
      <br>\n\n
      """)

    username = metadata['username'].replace("/", "_")

    directory = (f"./Downloads/kemonoparty/{metadata['subcategory']}/{username} [{metadata['user']}]/[{metadata['id']}]")

    with open(f"{directory}/[{metadata['id']}] Details.html", "w", encoding="utf-8") as fp:
      fp.write(html_content)

    with open(f"{directory}/[{metadata['id']}] Details.txt", "w", encoding="utf-8") as f:
      f.write(text_maker.handle(html_content))

    if text_links != ["None"] or embed_links is not None:
        links_html_content = (f"""\
            <hr>
            <h1 style='display: inline'>
                <a href='https://kemono.su/{metadata["service"]}/user/{metadata["user"]}/post/{metadata["id"]}'>Links in Post #{metadata["id"]}</a>
            </h1>
            <div>
            </div>
            <div class='content'>
                <b>Links (Text):</b>
                <br><br>{formatted_links_html}
                <br>
                <b>Links (Embeds):</b>
                <br><br>{formatted_embed_html}
            </div>
            <br>
            <hr>
        """)

        with open(f"{directory}/[{metadata['id']}] Links.txt", "w", encoding="utf-8") as f:
            f.write(text_maker.handle(links_html_content))

def kemono_DMs(metadata):
    from datetime import datetime
    import sys

    sys.path.insert(1, './')

    import html2text
    text_maker = html2text.HTML2Text()
    text_maker.ignore_links = True
    text_maker.protect_links = True

    current_time = datetime.now()
    formatted_date = current_time.strftime('%Y-%m-%d') 
    formatted_time = current_time.strftime('%H:%M:%S')

    def kemono_dm_extractor(metadata):
        dms = []
        count = 1
        for dm in metadata["dms"]:
            dm_body_html = dm["body"].replace("\n", "<br>")
            dms.append(f"""\
              <div class="user__dms">
              <article class="dm" id="DM #{count}">
              <header class="dm__header">
              <a class="fancy-link fancy-link--local comment__name" href="#{count}">
                  DM #{count}
              </a>
              </header>
              <section class="dm__body">
              <p class="dm__message">
                  {dm_body_html}
              </p>
              </section>
              <footer class="dm__footer">
              <time class="timestamp" datetime="{dm["date"]}">
                  <font COLOR="#01a049">Published: {dm["date"]}</font>
              </time>
              </footer>
              </article>
              </div><br><br>
              """)
            count += 1

        dms_joined = "\n".join(dms)
        return dms_joined

    html_content = (f"""\
      <h1 style='display: inline'>
        <a href='https://kemono.su/{metadata["service"]}/user/{metadata["user"]}'>{metadata["username"].capitalize()}'s DMs</a>
      </h1>
      <div>
        <br>
      </div>
      <div class='content'>
        <b>DMs:</b>
        <br><br>{kemono_dm_extractor(metadata)}
      </div>
      <br>
      <hr>
      <div>Saved: {formatted_date} @ {formatted_time}</div>
      <br>\n\n
      """)

    username = metadata['username'].replace("/", "_")

    directory = (f"./Downloads/kemonoparty/{metadata['subcategory']}/{username} [{metadata['user']}]")

    with open(f"{directory}/[{metadata['user']}] DMs.html", "w", encoding="utf-8") as fp:
      fp.write(html_content)

    with open(f"{directory}/[{metadata['user']}] DMs.txt", "w", encoding="utf-8") as f:
      f.write(text_maker.handle(html_content))

gallery-dl.conf

            "postprocessors":[
                {
                    "name":"metadata",
                    "event": "post",
                    "filename":"[{id}] Metadata.json"
                },
                {
                    "name":"metadata",
                    "event": "post",
                    "filename":"[{id}] {title}.title.txt"
                },
                {
                    "name": "python",
                    "event": "post",
                    "function": "./gdl_utils:kemono_details"
                },
                {
                    "name": "python",
                    "event": "finalize",
                    "function": "./gdl_utils:kemono_DMs"
                }
            ]
a84r7a3rga76fg commented 6 months ago

Getting this error with your update: [postprocessor][error] 'python' initialization failed: ModuleNotFoundError: No module named 'gdl_utils'.

I think it's because I'm using a different directory.

DrQuantum101 commented 6 months ago

Please tell me the directory where gdl_utils.py is placed and where your gallery-dl.exe and gallery-dl.conf is located. All three files are placed in the same directory for me. If your configuration file is still in the default location, the code will not work, all aforementioned files must be in the same directory.

a84r7a3rga76fg commented 6 months ago

No need for that. I found out that I've to use separate configs for DMS, files and comments to not download the same information for every run.

DrQuantum101 commented 6 months ago

I will reopen this issue because this pip package finding issue is causing a lot of problems for me.

TO RECAP

For some reason, the Python post-postprocessor has an issue accessing the PATH to find where pip packages are. To solve this:

I added sys.path.insert(1, './') to the top of the function before imports. This adds the gallery-dl.exe directory to the Python PATH to search for library packages.

This is fine if when I was only going to use html2text because it doesn't have dependencies, but with other packages, it got messy fast, so I'm in the process of testing and switching to the method below

I then added

    sys.path.insert(1, 'C:/Users/ADMIN/AppData/Local/Programs/Python/Python311/Lib/site-packages')
    sys.path.insert(2, 'C:/Users/ADMIN/AppData/Local/Programs/Python/Python311/Lib')

To the top of the function. This should have added the pip python package directory and default package directory to the list of directories to check for libraries for import. This is so that I can use the urlextract library with its dependencies.

This is the --verbose ouptut:

[gallery-dl][debug] Version 1.26.8 - Executable
[gallery-dl][debug] Python 3.8.10 - Windows-10-10.0.22631
[gallery-dl][debug] requests 2.31.0 - urllib3 2.1.0
[gallery-dl][debug] Configuration Files ['FILEPATH'] # Filepath Obscured for Sensitivity
[gallery-dl][debug] Starting DownloadJob for 'https://kemono.su/patreon/user/3295915/post/101552668'
[kemonoparty][debug] Using KemonopartyPostExtractor for 'https://kemono.su/patreon/user/3295915/post/101552668'
[cookies][debug] Extracting cookies from C:\Users\ADMIN\AppData\Roaming\Mozilla\Firefox\Profiles\env16i3q.default-release-1713214542371\cookies.sqlite
[cookies][info] Extracted 1806 cookies from Firefox
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): kemono.su:443
[urllib3.connectionpool][debug] https://kemono.su:443 "GET /patreon/user/3295915 HTTP/1.1" 200 12828
[urllib3.connectionpool][debug] https://kemono.su:443 "GET /api/v1/patreon/user/3295915/post/101552668 HTTP/1.1" 200 1257
[urllib3.connectionpool][debug] https://kemono.su:443 "GET /patreon/user/3295915/post/101552668 HTTP/1.1" 200 24901
[urllib3.connectionpool][debug] https://kemono.su:443 "GET /patreon/user/3295915/dms HTTP/1.1" 200 None
[kemonoparty][debug] Skipping /d1/f9/d1f986bbd6e78df6a958d23efc88259b3c44c44fa03d3335911257a3c2a5cc6c.png (duplicate)
[kemonoparty][debug] Using download archive 'A:/Miscalaneous/Backup Files/Homework/Lenny Face/Images & Comics/gallery-dl/Archives/kemono-archive.sqlite3'
[kemonoparty][debug] Active postprocessor modules: [MetadataPP, MetadataPP, PythonPP, PythonPP]
[kemonoparty][error] An unexpected error occurred: ModuleNotFoundError - No module named 'platformdirs'. Please run gallery-dl again with the --verbose flag, copy its output and report this issue on https://github.com/mikf/gallery-dl/issues .
[kemonoparty][debug]
Traceback (most recent call last):
  File "gallery_dl\job.pyc", line 128, in run
  File "gallery_dl\job.pyc", line 175, in dispatch
  File "gallery_dl\job.pyc", line 353, in handle_directory
  File "gallery_dl\postprocessor\python.pyc", line 36, in run
  File "A:\Miscalaneous\Backup Files\Homework\Lenny Face\Images & Comics\gallery-dl\.\gdl_utils.py", line 11, in kemono_details
    from urlextract import URLExtract
  File "C:\Users/ADMIN/AppData/Local/Programs/Python/Python311/Lib/site-packages\urlextract\__init__.py", line 1, in <module>
    from .urlextract_core import URLExtract, _urlextract_cli, __version__
  File "C:\Users/ADMIN/AppData/Local/Programs/Python/Python311/Lib/site-packages\urlextract\urlextract_core.py", line 25, in <module>
    from urlextract.cachefile import CacheFile, CacheFileError
  File "C:\Users/ADMIN/AppData/Local/Programs/Python/Python311/Lib/site-packages\urlextract\cachefile.py", line 22, in <module>
    from platformdirs import user_cache_dir
ModuleNotFoundError: No module named 'platformdirs'

I noticed that the Python version is Python 3.8.10, which is a mismatch to the Python 3.11 I have on my Windows 11 system. The post-processor also has issues accessing default libraries such as dataclasses.py if its directory is not passed through using sys.path.insert.

So now the issue is that the Python used in the gallery-dl post-processor is seemingly unable to pass the library location info to the urlextract module, which inturn, makes urlextract unable to find the platformdirs dependency which does exist in the pip packages directory.

mikf commented 6 months ago

I noticed that the Python version is Python 3.8.10, which is a mismatch to the Python 3.11 I have

… and that's the entire reason why. Importing packages from a different Python environment let alone from a different Python version obviously does not properly work.

You need to install gallery-dl with pip into the same environment as the packages you wish to import. You shouldn't be using the .exe version if you want to access Python modules outside the bundled ones.

DrQuantum101 commented 6 months ago

Solved: I should not have held onto the .exe version for as long as I did.

image