mmcdole / gofeed

Parse RSS, Atom and JSON feeds in Go
MIT License
2.56k stars 208 forks source link

Limit the maximum number of feed items when parsing #199

Open imirzadeh opened 1 year ago

imirzadeh commented 1 year ago

Motivations

1- Some feeds are not well-maintained, especially podcasts like NYTimes Daily or this one. In those feeds, usually all/most of the history (hundreds of items) is in the feed. This leads to a very long parsing time.

2- Usually, people use RSS parser libraries to check the latest news and don't necessarily need all the items to be parsed. Given that gofeed parses xml using the pull method, it can stop processing at certain points and save time and compute.

Suggestion

For the reasons explained above, I think we can add an optional ParseConfig object to the Parser to fetch the posts faster.
To make things backward compatible, we can add a SetConfig method that users can explicitly call to change options.

imirzadeh commented 1 year ago

Update

I did a little bit of digging and found out implementing this is actually not as easy at it seems. For the NYTimes feed above, with some of hacks, I managed to improve the performance significantly when I only I stopped parsing after seeing 10 RSS items:


BenchmarkGoFeed-All                    8     139441776 ns/op    141147331 B/op    819102 allocs/op
BenchmarkGoFeed-10+Skip           10     107165758 ns/op    101558910 B/op    526856 allocs/op
BenchmarkGoFeed-10+Stop          273       4106181 ns/op     64155609 B/op      5919 allocs/op

So in the results above we compare:

mmcdole commented 1 year ago

This is a great idea.

I also am keen on introducing a config object, and putting some other things in there as well like user agent, proxy, strictness settings.

We could put it behind a setter, as you suggested, or cut a v2 breaking change and clean up the interface a bit.

mmcdole commented 1 year ago

I'm been kind of mulling over what options we might want in this config object, that have been requested in the past, or might be beneficial to give control over.

These are what I've come up with. Let me know your thoughts on these if you get a chance.

type ParseOptions struct {
    // MaxItems specifies the maximum number of feed items to parse.
    // The default value is 0, which means no limit.
    MaxItems int

    // ParseDates determines if the feed parser will attempt to parse dates into `time.Time` objects.
    // The default value is true.
    ParseDates bool

    // KeepOriginalFeed specifies if the parser should retain the raw feed in the `Feed` struct's `RawFeed` field.
    // The default value is false.
    KeepOriginalFeed bool

    // StrictMode determines if strict parsing rules should be applied.
    // If set to true, all strictness settings are enabled.
    // The default value is false, and all StrictnessOptions default to their least strict settings.
    StrictMode bool

    // StrictnessOptions holds the options for controlling the strictness of the parsing.
    StrictnessOptions StrictnessOptions

    // RequestOptions holds the options for the HTTP request.
    RequestOptions RequestOptions
}

type StrictnessOptions struct {
    // StripInvalidCharacters specifies if invalid feed characters should be stripped out.
    // The default value is true.
    StripInvalidCharacters bool

    // AutoCloseTags specifies if the parser should automatically close unclosed tags.
    // The default value is true.
    AutoCloseTags bool

    // AllowUndisclosedXMLNamespaces specifies if the parser should allow undisclosed XML namespaces.
    // The default value is true.
    AllowUndisclosedXMLNamespaces bool

    // AllowIncorrectDateFormats specifies if the parser should allow incorrect date formats.
    // The default value is true.
    AllowIncorrectDateFormats bool

    // AllowUnescapedMarkup specifies if the parser should allow unescaped / naked markup in feed elements.
    // The default value is true.
    AllowUnescapedMarkup bool
}

type RequestOptions struct {
    // Timeout is the maximum amount of time to wait for the request to complete.
    // The default value is 0, which means no timeout.
    Timeout time.Duration

    // UserAgent is the value of the `User-Agent` header to be sent in the request.
    // This header is used to identify the client software and version to the server.
    // The default value is "gofeed/2.0.0" (the current version of gofeed).
    UserAgent string

    // IfNoneMatch is the value of the `If-None-Match` header to be sent in the request.
    // This header is used to make a conditional request for a resource, and the server will return the resource only if it does not match the provided entity tag.
    // If the server has a matching `Etag`, it will not respond with the full feed, and instead return a `304 Not Modified` response.
    // The default value is an empty string, which means no `If-None-Match` header will be sent.
    IfNoneMatch string

    // IfModifiedSince is the value of the `If-Modified-Since` header to be sent in the request.
    // This header is used to make a conditional request for a resource, and the server will return the resource only if it has been modified since the provided time.
    // If the server determines that the resource has not been modified since the provided `If-Modified-Since` time, it will not respond with the full feed, and instead return a `304 Not Modified` response.
    // The default value is an empty string, which means no `If-Modified-Since` header will be sent.
    IfModifiedSince string
}

I also realized that the universal feed parser's Detector is reading in the entire feed. This is a bug/mistake, I want to only read in a small buffered amount and let the feed be parsed as a buffered reader. So, I'll need to clean that up. That might help in the scenario you describe above with a very large feed and your intention to only read a subset of it? If you were using the universal parser at least.

imirzadeh commented 1 year ago

I thought about the options. Overall, I like your proposal!

I have been working on extracting feeds for over a year now (for a personal project). Given the challenges I faced, I thought I share my suggestions:

1- This is very subjective, but I think the request fetching is out of the scope of gofeed, which is concerned with parsing. The fact that it can parse from urls is fantastic, but I think if someone needs a very advanced fetching mechanism, they can implement them themselves. For instance, some RSS feeds have huge body sizes and require special headers, auth, etc. Overall, I think parsing differs from fetching, and it may not be worth it if it makes the code/API more complex. But this is just my opinion, obviously :)

2- Regarding the ParseOptions and StrictnessOptions, I pretty much like the proposal. Another highly optional strictness field could be HTML sanitization for security reasons, but I think users can do this as a post-processing step, and I don't think is critical at all. One other suggestion could be using a skip list where some tags could be skipped if it improves the performance. I'm not sure if it does since given my benchmark, most of the gofeed's time is spent on decoding elements.

3- Related to the previous suggestion, I think gofeed can have separate utility functions that are helpful when working with feeds. For instance, we can detect RSS links based on some heuristics. Other utilities could be URL resolution and HTML sanitization. I may be able to help with these.

All in all, I think the proposal is very nice! Thank you again for sharing gofeed with us! :)

mmcdole commented 1 year ago

Thanks @imirzadeh .

I think you are right. I'll leave the ParseURL functions, but will encourage people who want to have custom behavior to use their own client implementation.