mawenbao / gofeed

gofeed is disigned to extract full-text rss feeds from websites which only provide partial feeds or none
BSD 3-Clause "New" or "Revised" License
9 stars 4 forks source link

gofeed

Gofeed was inspired by feed43.com. It is disigned to extract full-text feeds from websites which only provide partial feeds or provide no feeds at all.

This simple program was written when I started to learn golang. So I tried to reinvent everything I need, including a simple crawler which took good use of cache and a very simple rss2.0 feed generator.

Releases

See http://dl.atime.me/gofeed.

Features

Things need to be improved

More functions on the todo list

  1. Cache old requests: use sqlite to cache downloaded web pages and save their lastmod time.
  2. Download html files for each feed target defined in the configuration in separate goroutines.
  3. Add debug mode, which will print more debug infomation
  4. Add alternative methods to extract feed title, link and description from html
    1. xpath

Install

Firstly, make sure you have set the GOPATH environment variable properly. Then, you should install the sqlite driver go-sqlite3.

go get github.com/mattn/go-sqlite3

Now install gofeed.

go get github.com/mawenbao/gofeed

Configuration example

Json configuration

See example_config.json and example_config2.json.

And you should note that

  1. There should be as many Feed.URL as Feed.IndexPattern. If array length of the two does not match, there should be only one Feed.IndexPattern or only one Feed.URL, which means all the Feed.URL will share the same Feed.IndexPattern or all the Feed.IndexPattern share the same Feed.URL. Otherwise, an configuration parse error will return.

  2. For Feed.ContentPattern, there should be as many Feed.URL as Feed.ContentPattern. If array length of the two does not match, there should be only one Feed.ContentPattern, which means all the Feed.URL will share the same Feed.ContentPattern. And the same goes for Feed.PubDateFormat.

  3. Either Feed.IndexPattern or Feed.ContentPattern can contain the {pubdate} pattern, but not both.

Pre-defined patterns

You can use the following predefined patterns in Feed.IndexPattern and Feed.ContentPattern of the json configuration. Note that all these patterns are lazy and perform leftmost match, which means they will match as few characters as possible.

Date time format pattern, currently used for publish date string extraced from the {pubdate} pattern. Note that, unlike other pre-defined patterns, all these date related patterns are greedy.

Custom regular expressions

You can also write custom regex in Feed.IndexPattern and Feed.ContentPattern. Make sure there are no predefined patterns in your custom regular expressions. The regex syntax documentation can be found here.

The custom regular expressions have not been tested properly. So I suggest just using the predefined patterns.

Command line options

Usage ./gofeed [-version][-v][-d][-c cpu_number][-l log_file][-k][-z compression_level] json_config_file

Flags:
-a=false: use cache if failed to download web page
-c=2: number of cpus to run simultaneously
-v=false: be verbose
-d=false: debug mode
-l="": path of the log file
-k=false: keep feed entries which do not have any description
-z=9: compression level when saving html cache with gzip in the cache database.
    0-9 acceptable where 0 means no compression
-version=false: print gofeed version

License

BSD license, see LICENSE.txt for more details.