kemayo / leech

Turn a story on certain websites into an ebook for convenient reading
MIT License
154 stars 24 forks source link

Can't filter for only chapter with "None" #66

Closed Victor239 closed 3 years ago

Victor239 commented 3 years ago

Trying to extract Pale, but it fails after trying to extract social media links for Twitter and Facebook, followed by a "None" page:

[sites.arbitrary] Extracting chapter @ https://palewebserial.wordpress.com/2021/08/31/summer-break-13-9/
[sites.arbitrary] Extracting chapter @ https://palewebserial.wordpress.com/table-of-contents/?share=twitter
[sites.arbitrary] Extracting chapter @ https://palewebserial.wordpress.com/table-of-contents/?share=facebook
[sites.arbitrary] Extracting chapter @ https://palewebserial.wordpress.com/table-of-contents/None
[sites] Load failed: waiting 10 to retry (404: https://palewebserial.wordpress.com/table-of-contents/None)
[sites] Load failed: waiting 10 to retry (404: https://palewebserial.wordpress.com/table-of-contents/None)
[sites] Load failed: waiting 10 to retry (404: https://palewebserial.wordpress.com/table-of-contents/None)
[__main__] ("Couldn't fetch", 'https://palewebserial.wordpress.com/table-of-contents/None')
[__main__] No ebook created

I followed the example here to try and exclude "None" with "chapter_selector": "#main .entry-content > p > a:not([href*=None])", but it skipped 95% of existing chapters that way.

kemayo commented 3 years ago

Try this:

{
    "url": "https://palewebserial.wordpress.com/table-of-contents/",
    "title": "Pale",
    "author": "Wildbow",
    "chapter_selector": "article .entry-content > p a",
    "content_selector": "article .entry-content",
    "filter_selector": ".sharedaddy, style, a[href*='palewebserial.wordpress.com']"
}
Victor239 commented 3 years ago

That works great, thank you!