jklmli / manga_downloader

Cross-platform, multi-site, multi-threaded manga downloader with over 5000 distinct mangas. Includes support for automated downloading via external .xml file and conversion for viewing on the Kindle.
MIT License
270 stars 53 forks source link

Fixes #87: Update the re_getImage pattern to fetch the image URL for MangaFox #88

Closed CharlieCorner closed 7 years ago

CharlieCorner commented 7 years ago

This fixes #87

The current pattern we have for Mangafox is:

re_getImage = re.compile('"><img src="([^"]*)"')

But on the actual page this is how the tag for the page image looks like; notice how there's a newline between the closing > of the a tag and the < of the img tag:

<div class="read_img"><a href="7.html" onclick="return enlarge()">
    <img src="http://h.mfcdn.net/store/manga/9/73-670.0/compressed/s001.jpg?token=372bb2d203787196b834b3c04d819077&ttl=1482973200" width="728" id="image" alt="Bleach 670: The Perfect Crimson at MangaFox.me"/>
            </a></div>              <div id="MarketGid9463" class="news-block-magick"><center><a href="http://mgid.com/" target="_blank">Loading...</a>
    </center></div>

We're now searching for img tags that have an id="image" which is what Mangafox is using to identify their pages on their website.

jklmli commented 7 years ago

Nice! I've long suspected that some of the regexes have grown stale. This is something good (and working >.>) CI would catch - kicking off a run twice a week can easily prevent this.

CharlieCorner commented 7 years ago

By the way, commit 76f7ad4 also included in this Pull Request fixes #89 . I forgot to mention this on the original post.