Hooks for entry processing

tewe commented 10 years ago

I'd like to turn summary feeds into full feeds. This does not need to happen per-user. Some feeds are so broken that it cannot be done in a generic way.

How about a directory from which all .py files are imported, and at the end of fetch_feed() you call a method on all PluginAPI.__subclasses__()?

passiomatic commented 10 years ago

I guess you are suggesting a way to implement #34 here. I'm open to that. However, I would like to delegate the process to external code as much as possible.

If you can to describe all the steps required I believe I can implement it.

tewe commented 10 years ago

See https://github.com/tewe/coldsweat/commit/4b30b4ebdcbd86fbdd22feb0c6f9b6cee8d998e6

I didn't use classes after all, to look more like the rest of the code. I also made importing explicit instead of implicit, in case you want to ship some that users first have to enable, and to work around the call order problem.

passiomatic commented 10 years ago

That's great. Thank you for the code. So, if I understand correctly you have a plugins folder where one places the plugin modules. A plugin module can have one or more handlers.

Each module is loaded via imp's find_module/load_module pair by looking up the config file and an hook for a specified event is registered via the decorator. Each event can be hooked by one or more handlers. Handling order depends on the import list found in the config file.

If possible I think the plugins folder should stay outside the coldsweat folder (that is, at the same level of static and data) so when one updates the package he just overwrites that folder.

To start it makes sense to plan a couple of events, something like entry_parsed/entry_saved, so one can make changes before or after an entry has been saved into the DB.

Overall I really like the approach. I'm thinking that using this architecture even entry scrubbing could become a plugin.

tewe commented 10 years ago

If you add this I can get started on #34

passiomatic commented 10 years ago

I'll do. Probably during this weekend.

passiomatic commented 10 years ago

@tewe, I've added an initial implementation to the issue53 branch. The plugin interface is likely to change in the near future but in the meantime current events are:

feed_saved
entry_parsed
entry_saved
fetcher_started

Notes

I'm not 100% sure about the event names, yet.

entry_parsed event passes to the plugin handler the original Feedparser parsed entity too, so the plugin can extract more information from it (think of category, tags, etc.). Does it makes sense to pass the same parsed_entry object to entry_saved?

fetcher_started event is useful to setup things at plugin level like the scrubber do. Also, it probably makes sense to have a "fetcher done" event too. Actually the fetcher_started event is only triggered when a bunch of feeds are about to be fetched, not when - for example - a single feed is added and fetched to db. I need to fix that.

I'm wondering why did you use that specialized "hooks" tuple instead of a regular function, the trigger_event(name, *args) looks a clearer to me.

Let me know if this interface is enough for your plugins. At this stage nothing is set in stone.

tewe commented 10 years ago

I see no point to differentiating entry_parsed and entry_saved, unless you allow plugins to prevent entries from being saved. This requires a return value, which is impossible if you trigger events through a helper function. I'd consider iteration and manual calling the default, hence the tuple.

passiomatic commented 10 years ago

I see no point to differentiating entry_parsed and entry_saved...

In your full feeds case you need to replace the content of a entry and then Coldsweat can save it, right? So you just need the entry_parsed. Is it correct?

...unless you allow plugins to prevent entries from being saved.

Definitely not a goal.

tewe commented 10 years ago

To complete feeds, parsed would suffice. But so would saved, as I can just save again. If I can't, there is no need for the Feedparser instance (plugins that use external storage are too complex for this API anyway)

My next plugin would be a filter. If I can't keep entries out of the global database I'll just mark them read for every user >:-)

passiomatic commented 10 years ago

To complete feeds, parsed would suffice. But so would saved, as I can just save again.

Yep, that was the idea, expose the pre and post save operations to avoid saving the entry twice. Or, in the case of entry_saved to grab the ID unique value from the database row. But we can always expose the entry_saved event later.

If I can't, there is no need for the Feedparser instance (plugins that use external storage are too complex for this API anyway)

In theory a plugin could use fetch_started as an opportunity to change the db schema with additional tables, maybe even fields for existing tables. But that would feel really hackish.

Anyway, I think that to start an absolute minimum would be:

fetch_started()
entry_parsed(entry, parsed_entry)
fetch_done(feeds)

That fetch_done could have a list of the feed just refreshed. Uhm.

passiomatic commented 10 years ago

Added some wiki documentation for the plugin interface: https://github.com/passiomatic/coldsweat/wiki/Fetcher-Plugin-Interface

passiomatic commented 10 years ago

Branch has been merged into 0.9.2 and everything seems to work as expected. If you find any bug please open a specific issue.

tewe commented 10 years ago

Sorry for putting this here, but I'm still not comfortable with rebasing.

The trigger_event('fetch_done', [feeds]) line in commands.py needs to lose the brackets.

passiomatic commented 10 years ago

Good catch. Too bad I've released 0.9.2 today. I'll fix that on the new 0.9.3 branch. Thank you.

passiomatic commented 10 years ago

The trigger_event('fetch_done', [feeds]) line in commands.py needs to lose the brackets.

@tewe: FYI: I've released version 0.9.3

tewe commented 10 years ago

When running the fetcher with multiprocessing, entry_parsed is called in the workers, but fetch_done in the parent. So you cannot easily aggregate data over all articles. Please document this.

passiomatic commented 10 years ago

I see what you are thinking here. I believe it need to be clarified that the fetch_start|done is triggered on fetcher start and end, not on a per-feed basis. Is it what you mean?

In theory the plugin should be transparent to the fact you are using multiprocessing or not.

tewe commented 10 years ago

When your plugin has state, e.g. a module-level variable, that gets modified in entry_parsed, those modifications will not be visible to fetch_done run under multiprocessing.

passiomatic commented 10 years ago

When running the fetcher with multiprocessing, entry_parsed is called in the workers, but fetch_done in the parent. So you cannot easily aggregate data over all articles. Please document this.

I've updated the wiki page with your remarks.

passiomatic / coldsweat

Hooks for entry processing #53

Notes