skeeto / elfeed

An Emacs web feeds client
The Unlicense
1.5k stars 118 forks source link

File index is large (9.9M), really open? a.k.a Tips about db maintenance. #11

Closed kmicu closed 11 years ago

kmicu commented 11 years ago

I have currently ~ 550 feeds. (a lot of youtube and vimeo feeds) Elfeed's db is ~ 526 MiB.

Can you give me some tips about db maintenance?

Any feedback is appreciated before I run out of disk space :)

skeeto commented 11 years ago

kmicu notifications@github.com writes:

I have currently ~ 550 feeds. (a lot of youtube and vimeo feeds) Elfeed's db is ~ 526 MiB.

Holy moley! I thought my ~140 was a lot. But I definitely want Elfeed to be able to handle 550 reasonably.

One of the slight annoyances about YouTube feeds is that the description contains the view count. This means every Elfeed fetch is going to create a new, mostly-redundant entry in the content database for each video on each feed fetch. That's probably responsible for most of that bloat.

Fortunately, the Elfeed garbage collector should take care of this. However, until just now, due to a typo it wasn't actually running when Emacs exits. I fixed this in 942c123. Your issue made me take a look at it to discover the problem. Sorry about that. My own database had gotten up to 320MB when it should only be about 20MB.

For now, try running the garbage collector manually. The function is `elfeed-db-gc' (non-interactive). The "safe" version is the one that was broken. It may take a few minutes to complete in your case (lots of disk IO). That will likely dramatically reduce the size of your database.

Can you give me some tips about db maintenance?

  • How can I physically remove entries older than X weeks?
  • How can I keep only favorite entries?
  • Maybe we can split elfeed db in two parts - operations and archive?

These are interesting ideas. If the above garbage collection run isn't satisfactory I can see about supporting features like this. It's actually a small amount of code. It should be as simple as this (untested),

(defun elfeed-db-purge (match-p)
  "Delete entry content for entries selected by MATCH-P."
  (with-elfeed-db-visit (entry _)
    (let ((ref (elfeed-entry-content entry)))
      (when (and (elfeed-ref-p ref) (funcall match-p entry))
        (elfeed-ref-delete ref)))))

To purge anything that's been read:

(elfeed-db-purge
 (lambda (entry) (not (memq 'unread (elfeed-entry-tags entry)))))

Note that the entries will all remain listed, you just won't be able to visit them with elfeed-show since the content is gone. Also, any content Elfeed sees again will be rewritten, so it will really only purge old content no longer listed by any feeds.

You could also write a hook for elfeed-new-entry-hook that throws out the content for YouTube entries and the like, since the content doesn't actually matter for these -- you only care about the link.

I've experimented with compressing the content database -- specifically the individual files -- but there are no real gains. Unless you're using something like ReiserFS, compressing a 3kB file to a 1kB file doesn't actually save any storage space when your filesystem is using 4kB blocks. I could potentially invent some kind of packfile format (continuing to channel Git) to archive a large amount of content into a single file, so that compression can actually provide some real savings.

Thanks for being my massive-feed-list guinea pig! I'd really like to see how well Elfeed works for you in the long run.

kmicu commented 11 years ago

Latest patch solved the problem. elfeed-db-gc works like a charm.

For ~ 500 feeds db dropped in size from ~520MiB to ~90MiB. Great success! For ~ 690 feeds db size is ~ 135MiB. Index size is still ~ 10MiB.

Regarding youtube feeds: Image preview and description are really useful to me, but I will try to filter out view count and stars/rating.

Thanks for fast patch and great work!

PS Currently I still double check my feeds in Feedly/TheOldReader and sometimes I can see YouTube clips missing in elfeed. Ratio is like a one in a hundred entries from YouTube feeds. But I need more time for investigation. Maybe this is related to my weak internet connection. I will try to test it with remote server.

skeeto commented 11 years ago

Good to hear it's working. Yeah, garbage collection doesn't reduce the index file size, just the content database. I did end up implementing a "packfile" format that reduces the content database size by another order of magnitude as long as auto-compression-mode works properly on your system (i.e. not Windows), but I haven't yet merged it into master (currently as branch "pack").

For the first two weeks I was using Elfeed I was also following along in The Old Reader to sort out missing entry bugs, and Elfeed eventually stopped missing anything from my feeds. I'd definitely like to see any cases that Elfeed misses. Could you share the problematic YouTube feed URLs?

kmicu commented 11 years ago

I will be logging these issues. They are rare and it will take some time - maybe 2 weeks. Currently I will be checking slow/mobile/unstable vs fast connection, different max connections settings...

I will keep you informed.

RafalBabinicz commented 10 years ago

Time for some summary.

I post this as an information and not as a request for some improvements. I plan to write some independent backend for elfeed, which can be started as a daemon on some remote server and use current elfeed only as a great client tool for reading and interacting with elfeed-db, but not for updating feeds db. As I can see emacs/elisp is not very good for updating feeds, I still need to open separate emacs instance to make an update (which takes over 15 minutes for 1400 feeds) and for me viewer/updater separation is a natural way of elfeed evolution like in mu (mail indexer) and mu4e (MUA). This is a work for the future with some low priority though.

Now back to the business.

YouTube

If you get error like (:error (error "gdata.youtube.com/443 Name or service not known")) it is because youtube marks you as a spammer.

You can read more about this at http://apiblog.youtube.com/2010/02/best-practices-for-avoiding-quota.html

"In the unlikely event that you do get flagged for excessive API requests, you'll receive an HTTP response with a code of 403"

As I have over 900 YT feeds this is a problem for me and such problems IIRC start to show up around ~400 YT feeds with network download speed around 200KiB/s.

If you have SSD disk and very fast connection then you will be marked as spammer with less then 400 YT feeds.

"As a best practice, we recommend stopping all API calls from your application for 10 minutes after receiving such an error in order to "reset" your quota."

"As long as your application includes a developer key along with your YouTube API requests, your requests will be less likely to be flagged for quota violations."

Now I simply expand youtube links in feed-patterns to

(defvar feed-patterns
        '(;;(youtube "http://gdata.youtube.com/feeds/api/users/%s/uploads?v=2")
          (youtube "http://gdata.youtube.com/feeds/api/users/%s/uploads?v=2&key=MY_VERY_LONG_API_KEY_HERE")
          ;; (youtube   "http://gdata.youtube.com/feeds/base/users/%s/uploads?v=2")
          ;; (playlist  "https://gdata.youtube.com/feeds/api/playlists/%s")
          (playlist "https://gdata.youtube.com/feeds/api/playlists/%s?v=2&key=MY_VERY_LONG_API_KEY_HERE")
          ;;(gmane     "http://rss.gmane.org/topics/complete/gmane.%s")
          (vimeooo     "http://vimeo.com/%s/videos/rss")
          (subreddit "http://www.reddit.com/r/%s/.rss"))
        "How certain types of feeds automatically expand.")

(Probably the same thing affects feedburner links, but I cannot confirm this for sure.)

Now with API key added I have less issues with youtube feeds.

Google is in process of migrating from v2 YT API to v3 and I only post this as an information for other crazy ones with over 400 YT feeds.

feeds ID

Here is an interesting problem - if you add API key or change API version to the URL like this: original url - https://www.youtube.com/user/ClojureTV API version 2 - http://gdata.youtube.com/feeds/api/users/ClojureTV/uploads?v=2 API key added - http://gdata.youtube.com/feeds/api/users/ClojureTV/uploads?v=2&key=MY_VERY_LONG_API_KEY_HERE

in each case you crate new feed.

This is a "problem", because if your API key change in the future (or any URL parameter), then your feed entries will be fetched one more time during update, of course all entries will be marked as unread and also your db will grow unnecessary which each change to URL parameter. (Feed entries will be duplicated if I understand correctly.)

It can be solved in many ways, but for now I only inform about this for future discussion.

Partial update

If you have to many feeds you can split update process. For example:

      (setq elfeed-feeds-all
            (loop with specials = (mapcar #'car feed-patterns)
                  for (url . tags) in elfeed-feeds-alist
                  for real-url = (feed-expand tags url)
                  do (setf (gethash real-url elfeed-tagger-db) tags)
                  collect real-url)) ; (length elfeed-feeds-all) 714 feeds

      (require 'dash)
      (defun separate-youtube (feeds)
        (-separate (lambda (feed) (string-match-p "youtube" feed)) feeds))

      (setq elfeed-feeds
            (let* ((separeted-feeds (separate-youtube elfeed-feeds-all))
                   (youtube-feeds (car separeted-feeds))
                   (regular-feeds (cadr separeted-feeds))
                   (partitioned-youtube (-partition-all 320 youtube-feeds))
                   (youtube-0 (nth 0 partitioned-youtube))
                   (youtube-1 (nth 1 partitioned-youtube))
                   (youtube-2 (nth 2 partitioned-youtube))
                   (youtube-3 (nth 3 partitioned-youtube))
                   )
              ;; (+ (length regular-feeds) (length youtube-feeds))
              ;;  (length (-distinct elfeed-feeds-all))

              ;; (-difference elfeed-feeds-all (-distinct elfeed-feeds-all))
              ;; youtube-feeds ; 952
              ;; youtube-0 ; 320
              ;; youtube-1
              ;; youtube-2
              ;; regular-feeds ; 458
              elfeed-feeds-all
              ))

This is not a whole solution, but only an example. You can create separete functions which update only a specific kind of feeds in one batch - 200 youtube feeds, 100 vimeo feeds, etc...

It is helpful if you have many feeds in the same domain or you want to split feeds in some groups like - for daily, weekly, monthly... update.

Minor random issues

I still have some minor issues. In one day I have some errors like:

Elfeed update failed for http://tmorris.net/atom.xml: (:error (error "tmorris.net/80 Name or service not known"))
Elfeed update failed for http://chrisdone.com/rss.xml: (:error (error "chrisdone.com/80 Name or service not known"))

or

Elfeed update failed for http://nullprogram.com/feed/: (wrong-type-argument number-or-marker-p nil)
Elfeed update failed for http://mostlylazy.com/feed/: (wrong-type-argument number-or-marker-p nil)
Elfeed update failed for http://lambdalounge.org/feed/: (wrong-type-argument number-or-marker-p nil)

Of course in browser they work perfectly.

Next time I do not have any errors. I totally understand that it can be related to some random network lags/cheap router/network libraries and this is not directly related to elfeed, but errors reported by elfeed are not very helpful here.

Instead connection-problem I see wrong-type-argument-number-or-marker-p in one day, but not in another day.

But they are random errors only for 1-9 feeds from 1400 and I can always elfeed-update-feed manually one by one anyway. No such big deal after all.

Summary

At the end I want to write that I am extremely happy with elfeed as my web feeds client. I am using it daily. I am aware that my case is not usual with over 1400 feeds (where a lot of them belong to the same domain). The purpose of this writing is purely informative.

And I really want to thank @skeeto for his excellent software and great work.

Cheers, kmicu

skeeto commented 10 years ago

I plan to write some independent backend for elfeed

I've learned a whole lot about databases in the past year, and looking back I can see all sorts of mistakes I made with Elfeed's database. If I were to start over, a lot of things would be different.

There's an "emacsql" branch I started a few months ago where I got halfway to porting the backend to EmacSQL. Having a proper ACID database like SQLite would fix a lot of problems, and it would make it really simply to update using separate Emacs subprocesses. However, I'm still nervous about pushing EmacSQL on Elfeed users because it's a pretty hefty package, having an external binary dependency.

Starting another route, I wrote an Emacs patch to add native SQLite bindings in Elisp. It's completed and working, but I haven't taken the time to finish polishing it. If I can get the patch accepted into core Emacs, I would definitely make use of it for Elfeed.

I guess my long term goal is to get the database running in SQLite one way or another. You're still right about Emacs being poor at updates, even with a new database, but maybe SQLite could help resolve it by making it easy for multiple subprocesses to safely stuff data into the database at once.

As I have over 900 YT feeds this is a problem for me and such problems IIRC start to show up around ~400 YT feeds with network download speed around 200KiB/s.

Just to add more information to this: the underlying url-retrieve makes significant attempts to reuse connections, so you really are running up against request limits. It's not an issue of too many TCP connections.

Google is in process of migrating from v2 YT API to v3

This migration means YouTube is scheduled to stop providing all Atom and RSS feeds at the end of April 2015. Crazy, right?! Personally I'm super bummed out about this. I don't know if YouTube will continue to be usable after that point. It's such a stupid thing. So don't get to wedded to whatever YouTube solution you figure out in the near future.

in each case you crate new feed.

This was a design decision I struggled with, and I believe I ultimately made the wrong choice. My concern was supporting poorly-made RSS feeds that had insufficiently unique "guid" tags. To cover for these situations, I decided to make the ID a tuple of the provided "GUID" with the feed URL. In practice this has turned out not to matter much. It would have been much better to risk not supporting crappy RSS feeds but keep all the advantages of Atom's "id" tag. If there's ever a backwards-incompatible database update for Elfeed, I would take the opportunity to fix this.

Next time I do not have any errors. I totally understand that it can be related to some random network lags/cheap router/network libraries and this is not directly related to elfeed, but errors reported by elfeed are not very helpful here.

This is primarily a bug in url-retrieve, unfortunately. It's calling the callback function telling it the page fetch was a success, but the buffer it delivers is entirely empty, including no HTTP headers. Since it's infrequent and unrepeatable, I haven't been able to debug this. Elfeed could be reporting something slightly different than "wrong-type-argument" but there's no information to go on other than "url-retrieve delivered an empty buffer for unknown reasons."

I've noticed it's consistent for a certain subset websites, especially my own (nullprogram.com). So it might be some specific server response or DNS configuration that's triggering this.

alphapapa commented 6 years ago

@skeeto Sorry if I missed it, but I can't find anything with a quick googling around. How did that Emacs patch turn out? Did it get rejected? Seems like adding support for SQLite to Emacs directly would be ideal.

skeeto commented 6 years ago

I didn't go anywhere with that patch. I got stuck on how to represent opaque SQLite handles in a garbage collection friendly way (process handles?). The right approach these days is to implement SQLite bindings as a module (Emacs 25) rather than a patch, which a few people have already done. Module "user pointers" are just about perfect for representing SQLite handles, and it's the thing I needed when writing my patch.

That being said, I've basically given up on the idea of using SQLite for Elfeed. Too many obstacles, and I value simple portability. The custom database is Good Enough right now, and I've since made significant efforts to improve Elfeed's elisp performance:

This allows Elfeed databases to scale further, and the march of technological progress bringing us faster and faster computers will gradually allows for larger and larger databases in the future. Emacs own elisp performance occasionally improves, too, with compiler, language, and VM enhancements.

alphapapa commented 6 years ago

Cool, thanks for explaining, I'll check out those links.

The only concern I have is about robustness. I know SQLite is very good about recovery from crashes, power loss, etc. Emacs handles this well for regular files that are open in buffers by saving them to temp files regularly and offering to recover them. Does Elfeed handle this as well, or is there risk of db corruption in these cases? I know it's not an easy problem to solve from scratch.

skeeto commented 6 years ago

The database is written out the same way Emacs normally writes out regular files. As long as write-region-inhibit-fsync is nil (the default), it will do something like:

1) write the database index to a temp file 2) fsync 3) rename the temp file over the target file 4) fsync

This means it's not possible to observe a partially-written database index even if there was a power failure in the middle of saving. It's actually really easy in this case because the entire index is written in full each time it's saved. It's a lot more difficult if a file is updated in place, especially if under contention with other threads and processes (e.g. the problem that SQLite solves). The loose "content" files are written with fsync disabled since this data is unimportant, and disabling fsync has dramatic performance improvements for that part of the database.

alphapapa commented 6 years ago

Thanks.