Request for comments: add media/page archiving capabilities to the Python Shaarli client

nodiscc commented 7 years ago

Hi, this is not intended to be merged.

I attached my current quick & dirty script to archive music from an export of my Shaarli instance. It's just a bash script, as I needed it quick. Currently it downloads music, which is what I needed. I'd like to rewrite it in Python, with well thought-out integration with the official client. Consider this as a proof of concept for a rewrite of https://github.com/nodiscc/shaarchiver

I'd like some input on how this would be best achieved:

How much code separation from the main client? How to properly implement it?
- Add a separate entry_point to setuptools?
- Add a --archive-media flag to shaarli?
- Add an actions = option in config file? Add extractor configuration there?
- Write a totally separate client and import shaarli-client as a library?

Some notes:

The original Shaarli feature request for archiving shaares contents is https://github.com/shaarli/Shaarli/issues/318
There's a brief discussion about content extraction for the python client at https://github.com/shaarli/Shaarli/issues/745
In https://github.com/shaarli/Shaarli/issues/106#issuecomment-74980648 it was suggested that multimedia/page content archiving/mirroring could be added directly as a Shaarli plugin. I think both a CLI archiving tool and a Shaarli plugin have their place (eg. I want to run the archive on my laptop, I don't want my webserver/PHP stack to exec() call youtube-dl, I have a shared host without youtube-dl/wget/... support...)
There will inevitably be some feature creep, as there are many use cases for web scraping and web content download in general - which is why I'm dubious about direct integration in the official API client. In the first time I intend to focus on 1. downloading multimedia content as it frequently disappears without notice 2. generating a friendly offline export of my shaares.
- --format text is broken for me (invalid option --format). I'll investigate that.

To get a clearer picture, I added a list of current shaarchiver features, as well as features that might reasonably be requested, to the script header. Have a look

With that mind, what is the best way to start implementing an archiving tool around the API? (@virtualtam this is for you :) I'd rather not add bloat to the shiny new API client - I think it should stay a clean, reference client. On the other hand well integrated actions/modules would be interesting)

Once I have a clearer picture I will start working on a basic implementation, and might as well ping people who were interested in a Shaarli archiving tool.

Again there is no rush :) ETA year 2018. I'd like to work on polishing the API client first, add some tests, etc.

Edits:

This project could be useful as an inspiration: https://github.com/pirate/bookmark-archiver/issues/3

virtualtam commented 7 years ago

Hi!

Here are some first thoughts :)

How much code separation from the main client? How to properly implement it?

Let's start simple:

keep a single codebase
leverage setuptools dependency management to specify optional features tied to 3rd-party dependencies
add a subcommand parser dedicated to data archival

IMO these operations should be performed separately:

query a Shaarli instance to get a list of links
parse a list of links and retrieve/archive corresponding media

On the long run, we'll see whether more granularity is needed to keep sources and CLI usage consistent.

Add extractor configuration there [in a config file]?

Archival preferences could be specified in a config file:

local archive directories
multimedia preferences, e.g. audio & video formats
...

There will inevitably be some feature creep, as there are many use cases for web scraping and web content download in general

As for the current REST client, 3rd-party integrations should be implemented in a library form, with a console entrypoint that may serve as a Minimal Working Example in case someone wants to customize data retrieval and/or processing.

multimedia/page content archiving/mirroring could be added directly as a Shaarli plugin [...] I don't want my webserver/PHP stack to exec() call youtube-dl, I have a shared host without youtube-dl/wget/... support...)

The archival tool could be wrapped in a web (micro)service providing a REST API, that would be called by the corresponding Shaarli plugin.

nodiscc commented 6 years ago

I've been thinking about this lately. Can't figure out how to add a subcommand parser that would run a function that does 1. get-link with the specified parameters 2. write the output to a file (JSON) 3. parse the file and run archival methods on the link list. The command line would be something like

shaarli archive-links --limit=200 --tags=something --outdir=archive/.

I can't simply add archive-links to endpoints since those specifically correspond to Shaarli API endpoints

All in all I'm thinking about starting a separate project that would depend on python-shaarli-client, but maybe you could point me to the right way of adding that subcommand parser?

virtualtam commented 6 years ago

Suggestions:

rename the current script to shaarli-api and add new scripts, e.g. shaarli-archive
move API commands to an api subparser, and declare other subparsers for specific actions:
- $ shaarli api <params>
- $ shaarli archive <params>
- $ shaarli <action> <params>

Option 2. seems more consistent, by providing a single entrypoint and action-specific subparsers, while keeping a single project/package to gather Shaarli archival tools.

virtualtam commented 6 years ago

@nodiscc there's also the possibility of providing an interactive CLI entrypoint using the click library (possibly overkill but potentially quite fun to write :) )

nodiscc commented 6 years ago

Hi, I wrote a small patch to implement an --outfile command line parameter, it got me up to speed and I have a clearer picture of how to implement basic shaarli api/shaarli archive... command line logic now (and thanks for your comment, that put me on the right track).

I'll make the final tests (python SSL warnings also lead me to finally ditch my server self-signed certs and setup Letsencrypt) and send a PR soon. It took me a while to pass the CI tests :)

Edit: re interactive interface: I'm more interested in the scripted/automated aspect of this tool right now, but I always wanted to look into python-click. Maybe someday :)

nodiscc commented 6 years ago

Moved to https://github.com/shaarli/python-shaarli-client/issues/24

shaarli / python-shaarli-client

Request for comments: add media/page archiving capabilities to the Python Shaarli client #22