mmagnus / Pocket-Plus-Calibre-Plugin

📚 Modified version of the Calibre plugin for Pocket. Now, you get your articles organized by your Pocket tags, and more!
269 stars 24 forks source link

v2.x.x feedback needed #18

Closed mmagnus closed 3 years ago

mmagnus commented 4 years ago

@alvaroreig @Monirzadeh @TheKiteRunning the plugin had a few problems, now they are fixed, I hope you will all guys enjoy it more

Monirzadeh commented 4 years ago

some time i get so many error like bellow

Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 410, in process_images
    data = self.fetch_url(iurl)
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 293, in fetch_url
    raise err
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 274, in fetch_url
    with closing(open_func(url, timeout=self.timeout)) as f:
  File "/usr/lib/python3/dist-packages/mechanize/_mechanize.py", line 241, in open_novisit
    return self._mech_open(
  File "/usr/lib/python3/dist-packages/mechanize/_mechanize.py", line 287, in _mech_open
    response = UserAgentBase.open(self, request, data)
  File "/usr/lib/python3/dist-packages/mechanize/_opener.py", line 193, in open
    response = urlopen(self, req, data)
  File "/usr/lib/python3/dist-packages/mechanize/_urllib2_fork.py", line 425, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3/dist-packages/mechanize/_urllib2_fork.py", line 414, in _call_chain
    result = func(*args)
  File "/usr/lib/calibre/calibre/utils/browser.py", line 29, in https_open
    return self.do_open(conn_factory, req)
  File "/usr/lib/python3/dist-packages/mechanize/_urllib2_fork.py", line 1233, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 0] Error>

or time out to get an image (that image open in browser easily ) in generation ebook step i get so many

Failed to find image url/of/image

Side note : i can't send a PR get no permission error with 403. i try to move oldest article to the top of script as config variable.

UPDATE 1: cleaner remove some part of Article. for example most of this article missing.

UPDATE 2: if you have so many item on your list (more than 1000) and auto_cleanup = False it take so much time to generate index.htm . It take 5h just to generate epub. Is it normal?

mmagnus commented 4 years ago

Hmm... it seems that your workflow is different than mine.

I'm not generating these huge books with so many articles, so I'm not sure how the plugin will behave (or likely there will be problems like you describe).

I usually generate short (~50 articles) ebooks, I guess of a size a few MB. This is the environment I'm testing the plugin, and I'm not getting the errors. For now, I have no time to test different environment and I will focus on keeping the plugin alive in this minimalistic form.

You can also try to rethink your workflow into more books of smaller size, as I commented also here #19.

Yeah, I also noticed that not all articles are formatted correctly, but I'm not sure where is a problem, in Calibre or in the plugin itself. For now, I accept that the plugin works for most articles.

Of course, I will be more than happy to accept any PR with improvements for all mentioned issues :-)

mmagnus commented 4 years ago

What I think would be cool with the current tags system to fetch all of your tags, and process automatically your My List (Inbox) Pocket without explicitly providing any tags.

Scenario:

You go around on the Internet, you collect articles and tag them, and then you fetch them. Pocket will detect all these tags and automatically process a new book with these tags.

I'm just not sure how to get all the tags, but maybe this is not that hard.

mmagnus commented 4 years ago

What is more annoying is that the titles of articles are often missing.

mmagnus commented 4 years ago

Eh, ok, I see that there are still some errors with sort_id. I think I have a fix for now.

However, still, when I crazy play with the settings often the result is not what I expected... something is missing in the code.. or we have to limit number of options to something that is reliable.

alvaroreig commented 4 years ago

Hey @mmagnus ,

I gave v2.3.1 a go and found a couple of things:

Regards

mmagnus commented 4 years ago

https://github.com/mmagnus/Pocket-Plus-Calibre-Plugin/releases/tag/v2.3.2

mmagnus commented 4 years ago

Ad 1. Yeah, I kind of what this option does... but now I realized that this is what was missing in my last post. My problem was that I could get only articles very recently added to my Pocket but this is exactly when my setting could be used. Now when I set:

TAGS = ['investing', 'pseudoscience', 'covid19',
        'politics',
        'car transport safety',
        'python'] 
INCLUDE_UNTAGGED = True
ARCHIVE_DOWNLOADED = False
MAX_ARTICLES_PER_FEED = 3000000000
OLDEST_ARTICLE = 100000
SORT_METHOD  = 'newest'
TO_PULL = 'all'

I get exactly what I want, all articles on, for example, pseudoscience.

So I cancel my previous post, I think now the plugin works as configured. I moved this option to the top of the file and documented it.

Ad 2. Yeah, this works as intended. Maybe the name is confusing, but now 'Untagged' are all articles that are NOT configured with TAGS options. This is to same extend good name, but it's confusing with tags of Pocket. This is why I wrote the post that it would be interesting to pull all tags from Pocket into TAGS and then Untagged articles would be really untagged. I change this because 'The latest' was also confusing, maybe "Misc', or 'The rest', I'm not sure. So, for now, this is untagged in a sense to be tagged or untagged with the TAGS options.

Or do you guys have a better name? What do you think about the automatic extraction of all tags from Pocket?

BTW now I think I fixed sort_id issue. I believe that for whatever reason for some downloaded articles this sort_id is missing. So at the moment, the plugin will just skip these articles, thus it will not crash.

mmagnus commented 4 years ago

WOW, I was able to do it! The implementation is ugly.. but works (for me at least).

https://github.com/mmagnus/Pocket-Plus-Calibre-Plugin/releases/tag/v2.4.0

No need for extra configuration of tags.

if TAGS = [] (empty list) then the plugin will connect Pocket and fetch articles based on the configuration of the plugin. Next, the plugin will get tags of these articles and group them into sections in the final ebook. If TAGS has elements, e.g., TAGS = ['tag1', 'tag2'] then only these tags will be fetched from Pocket.

Screen Shot 2020-05-15 at 11 27 54 AM
mmagnus commented 4 years ago

screenshot_2020_05_15T11_47_57+0200 screenshot_2020_05_15T11_48_12+0200

mmagnus commented 4 years ago

https://github.com/mmagnus/Pocket-Plus-Calibre-Plugin/releases/tag/v2.4.1

Monirzadeh commented 4 years ago

is it possible to get specific tag and untag item? something like this [TAGS = ['tag1', 'tag2', '']

mmagnus commented 4 years ago

I'm not sure what you mean, but it this will do the job:

TAGS = ['tag1', 'tag2'] # fetch these two tags
INCLUDE_UNTAGGED = True # fetch the rest as untagged (untagged in this plugin context)

UPDATE:

I got exactly that:

TAGS = ['covid19', 'politics'] # [] or ['tag1', 'tag2']
INCLUDE_UNTAGGED = True
ARCHIVE_DOWNLOADED = True
MAX_ARTICLES_PER_FEED = 30 
OLDEST_ARTICLE = 7
SORT_METHOD = 'newest'
TO_PULL = 'unread'
TITLE_WITH_TAGS = False
TITLE_WITH_DATE = True
Screen Shot 2020-05-15 at 1 44 12 PM
mmagnus commented 4 years ago

or you want to get an article and untag it (so remove a tag for this article? = so the article has no tag)?

Monirzadeh commented 4 years ago

I'm not sure what you mean, but it this will do the job:

TAGS = ['tag1', 'tag2'] # fetch these two tags
INCLUDE_UNTAGGED = True # fetch the rest as untagged (untagged in this plugin context)

UPDATE:

I got exactly that:

TAGS = ['covid19', 'politics'] # [] or ['tag1', 'tag2']
INCLUDE_UNTAGGED = True
ARCHIVE_DOWNLOADED = True
MAX_ARTICLES_PER_FEED = 30 
OLDEST_ARTICLE = 7
SORT_METHOD = 'newest'
TO_PULL = 'unread'
TITLE_WITH_TAGS = False
TITLE_WITH_DATE = True
Screen Shot 2020-05-15 at 1 44 12 PM

thanks. it is that i exactly want about missing title you have all of them on content table but some of them missing when you open the article

mmagnus commented 4 years ago

Yeah, I know, I know, they are also at the top of the page in the book.

politics > <title>

however, I still don't know why there are sometimes missing from the articles.

Monirzadeh commented 4 years ago

Yeah, I know, I know, they are also at the top of the page in the book.

politics > <title>

however, I still don't know why there are sometimes missing from the articles.

i try to test that if found anything report here is it possible to use that without gui? just command line it can be useful to set as cron job

mmagnus commented 4 years ago

I'm not sure about it. It would be great, I've never researched it though.

@akaped also wanted something like this but I'm not sure if he got this running from crontab..,

Monirzadeh commented 4 years ago

this project has cli option for the caliber plugin maybe can be helpful some other option can go to the top of script like

    auto_cleanup = False
    no_stylesheets = True
    use_embedded_content = False
    ignore_duplicate_articles = {'url'}

what dose exactly do this option?

use_embedded_content = False

i don't know why i can't send a PR (403 Error)

mmagnus commented 4 years ago

I'm not sure why you can't PR. I added you as a collaborator.

mmagnus commented 4 years ago

I'm not sure what this options do, so I can't document them, so I didn't move them to the top. But sure, we can do it, can you write some documentation what they do?

Monirzadeh commented 4 years ago

I'm not sure what this options do, so I can't document them, so I didn't move them to the top. But sure, we can do it, can you write some documentation what they do?

ok i document them if i find out.

alvaroreig commented 4 years ago

Hi there,

@mmagnus that was fast! I've tested it a little bit. Some comments:

INCLUDE_UNTAGGED => I would name it INCLUDE_EVERYTHING_ELSE, with a comment explaining that if you are using autotag (TAGS = [] ) EVERYTHING_ELSE means UNTAGGED, but if you are specifying a set of tags (TAGS = ['tag1', 'tag2']) then EVERYTHING_ELSE means UNTAGGED + articles tagged with any other tag different than tag1 and tag2

ARCHIVE_DOWNLOADED = works pretty well, but in some articles I had it failed the first time, but not the second. I don't think it is significant, will keep and eye for it the following weeks.

SORT_ORDER = in some of my feeds it works funny, meaning that the articles are not symmetrical if I try it with oldest and then with newest (with ARCHIVE_DOWNLOAD = false). As the date of the article is pocket's date added, I guess it is a problem with certain articles and/or Pocket's API. Will keep and eye for it as well.

@mmagnus if you are OK with my first point I can PR with the comments rewritten. I would clarify a little the explanation of AUTOTAG if it is ok with you. Feel free to accept part of my comments or none at all.

Thanks a lot, regards

mmagnus commented 4 years ago

@alvaroreig sure, I added you as a collaborator, feel free to edit it.

mmagnus commented 4 years ago

BTW, for now, only one tag per article will work well.. with two or more, to be improved in the future :-)

mmagnus commented 4 years ago

@Monirzadeh @akaped

I have been trying to find a way to fix missing titles. I got to a page with some extra info on plugins [1], and there was info on how to debug the plugin and how to run it from the terminal. And then EUREKA! This is what we want to be able to run this script in the background. I also felt annoying to open Calibre to send me some news whenever I want to read them. I hacked also some Python script [2] that sends a binary file using Gmail to Amazon, works perfectly! (For Gmail you have to change settings to "Less secure connection", or find another way to send this file to Amazon).

(py37) [mx] Pocket-Plus-Calibre-Plugin$ git:(master) ✗ rm Pocket.mobi; ebook-convert Pocket.recipe .mobi && ./Pocket-send-to-amazon.py
Conversion options changed from defaults:
  test: None
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
Using custom recipe
Using user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36
1% Fetching feeds...
1% Got feeds from index page
1% Trying to download cover...
1% Generating masthead...
Synthesizing mastheadImage
1% Starting download [5 threads]...
17% Article downloaded: Sending an email via the Python email library throws error "expected string or bytes-like object"
34% Article downloaded: How to Get Free Magazines on Your Kindle with Calibre
Failed to generate default cover
34% Feeds downloaded to /private/var/folders/yc/ssr9692s5fzf7k165grnhpk80000gp/C/calibre_4.16.0_tmp_fXeFHz/HvTO4__plumber/index.html
34% Download finished
Parsing all content...
Forcing feed_0/article_0/index.html into XHTML namespace
Forcing feed_1/article_0/index.html into XHTML namespace
Forcing index.html into XHTML namespace
Referenced file u'feed_2/index.html' not found
34% Running transforms on e-book...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Cleaning up manifest...
Trimming unused files from manifest...
Creating MOBI Output...
67% Running MOBI Output plugin
Serializing resources...
Converting TOC for MOBI periodical indexing...
Creating MOBI 6 output
Generating in-line TOC...
Applying case-transforming CSS...
Rasterizing SVG images...
Converting XHTML to Mobipocket markup...
Serializing markup content...
  Compressing markup content...
Generating MOBI index for a periodical
MOBI output written to /Users/magnus/workspace/Pocket-Plus-Calibre-Plugin/Pocket.mobi
Output saved to   /Users/magnus/workspace/Pocket-Plus-Calibre-Plugin/Pocket.mobi
OK

[1] https://manual.calibre-ebook.com/news.html [2] https://github.com/mmagnus/Pocket-Plus-Calibre-Plugin/blob/master/Pocket-send-to-amazon.py

mmagnus commented 4 years ago

@alvaroreig yeah, thanks for the suggestion, this is so great!

 $ calibre-smtp -a Pocket.mobi -u mag_dex --password XXXXXX FROM AMAZON_MAIL BODY --encryption-method SSL -v -r poczta.o2.pl

works!

mmagnus commented 4 years ago

screenshot_2020_05_16T18_02_13+0200 BTW I added a nice touch to the book :D

mmagnus commented 4 years ago

OK, I asked Kovid Goyal kovid@kovidgoyal.net to take a look at the code, maybe he can help with the missing titles #22

mmagnus commented 4 years ago

BTW I developed a hacky way to trigger the plugin from your phone saying to Siri

https://github.com/mmagnus/Pocket-Plus-Calibre-Plugin/tree/master/push

I guess for Android phones something similar can be hacked :-)

mmagnus commented 4 years ago

https://github.com/mmagnus/Pocket-Plus-Calibre-Plugin/tree/master/pocketX

I think this might be interesting for some of you. We all have different workflows, but this version of pocket and recipes so far is amazing for me dealing with paywall journals and some more complex ways how to process web pages.

mmagnus commented 4 years ago

OK, I asked Kovid Goyal kovid@kovidgoyal.net to take a look at the code, maybe he can help with the missing titles #22

I think I fixed missing title with postprocess_html() Works for me for now.

https://github.com/mmagnus/Pocket-Plus-Calibre-Plugin/commit/4909dba9d16a79ae0388f00b7821440f7fd49a5c

mmagnus commented 4 years ago

I coded adding URL to pocketx.recipe . This is pretty cool. You can use Kindle Browser to go the page and see for example full page with comments. Pretty useful.

screenshot_2020_06_03T22_44_43+0200 screenshot_2020_06_03T23_00_14+0200

mmagnus commented 4 years ago

I also coded auto tags assignment based on URLs.

URLS_TO_TAGS = {'investing':
                ['fool.com',
                 'finance',
                 'marketwatch.com'],
                 'rna':
                ['rnajournal']
                 } # or nothing: {} to switch off

this will assign given tags to articles with given URLs.

Screen Shot 2020-06-04 at 6 18 29 PM
mmagnus commented 4 years ago

My plan now is to add QR code for this to be able to view any content on my phone, for example, YouTube video.

mmagnus commented 4 years ago

WORKS!!!!!!!!!!!!!!

screenshot_2020_06_04T20_58_08+0200

mmagnus commented 4 years ago

Calibre is using its own Python, so you have to add something like this to import qrcode

"""
install pip qrcode and add a path here for "Calibre's Python"
if you don't know what the path is, just run your python and then

>>> import qrcode
>>> help(qrcode)
"""
sys.path.append("/usr/local/lib/python2.7/site-packages") 
import qrcode
mmagnus commented 4 years ago

Ok, I see adding titles and URLs not always work. I'll try to fix it.

mmagnus commented 4 years ago

OK, I pushed my changes regarding URL and QR.

alvaroreig commented 4 years ago

OK, I pushed my changes regarding URL and QR.

Hi there @mmagnus

I just updated to 2.6.2 and I can't see the url or the QR. Are you sure?

By the way, the recipe fails to download images in one of my favorite webpages:

Failed to find image: https://www.jotdown.es/2020/07/a-las-que-callan/ Any tips on how to debug/fix it?

Thanks, regards