rubys / venus

Planet Venus is an awesome ‘river of news’ feed reader. It downloads news feeds published by web sites and aggregates their content together into a single combined feed, latest news first.
http://intertwingly.net/code/venus/docs/index.html
Other
274 stars 99 forks source link

entry date / mtime handling is problematic #15

Closed hossman closed 12 years ago

hossman commented 12 years ago

(note: this is a somewhat long winded stream of consciousness rant, because i'm not really sure what the best solution is ... i'm not sure i even understand all the problems)

"ignore_in_feed: updated" seems really handy, but it's an incredibly painful double edged sword.

For my purposes, what i really (truly) want is for entries to never be reordered in my planet, even if they are updated in the source feed.

If you use "ignore_in_feed: updated" you get something close to this behavior, except that:

...that last point sounds like a positive, except that anytime you add a new feed w/o "published" dates, every item in that feed suddenly floods the "top" of the planet. Likewise if you need to purge your cache for some reason (ie: if you've added a new filter and want all the existing items to be updated) suddenly all sorts of old items appear at the top of the planet.

What i'd like is to have a "sort_date" which is defined along the lines of:

if entry has never been seen before and is not in the cache:
    set entry.sort_date = min(now_if_null(entry.published_date), now_if_null(entry.updated_date))
else:
    set entry.sort_date = cache.get(entry.id).sort_date

...and then have venus use that sort_date for the sorted list (i'm happy to still display the published & updated dates in the templates, although it would be nice to have the sort_date in the template as well)

Having already realized that venus sets the "mtime" of the cache files based on the "updated" date of each entry, I set out to write a plugin filter that would implement that logic, but using the "updated" date of each entry as the sort so i wouldn't have to modify any of the main venus code for dealing with the cache file mtimes.

To start with, i just made my filter replace the "updated" date with the "published" date if it existed and was earlier -- but then i realized that even though my filter was setting the "updated" just fine, it wasn't being reflected in the "mtime" of the cache file. This was perplexing, but I figured it wasn't a big deal: i was going to have my filter read from the cache anyway, i could also make it touch the cache file with the appropriate mtime value -- but when i looked closer at the venus code i realized that when spider.py's writeCache method is looping over the feed entries, it seems to go out of it's way to save out the "updated" date of the entry before running any filters -- and then it uses that saved "updated" value to set the mtime on the final cache file after the filters run -- so even if my plugin did reach into the cache to touch the file, spider.py would undo that.

hence this writeup.

It seems like the flow of control in spider.py is kind of backwards, it seems like filters should be able to modify just about anything in the entry, including the "updated" date, and spider.py should respect that final computed value and use it as the mtime of the file -- but the flow of that method almost seems intentional, like there is a deliberate reason why that's not allowed, and if there is i'm definitely curious as to why that is.

Alternately: if anyone has any suggestions on how to achieve the goal stated above (to have the date used for sorting entries come from the feed the first time the entry is encountered, but never change once it's in the cache) I'd love to hear them as well. so far the best i've come up with is to use "ignore_in_feed: updated" when i let venus run as a cron, but comment it out any time I add a new feed or after flushing the cache. (but that's error prone, and a pain in the ass)

hossman commented 12 years ago

pull request #16 has code that addresses most of the issues mentioned above. The only thing it doesn't address is adding a completely new "sort_date" property that can be independent from the updated & published dates -- updated is still used as the mtime, but now filters can modify it.