muchdogesec / history4feed

Creates a complete full text historical archive for an RSS or ATOM feed.
https://www.dogesec.com/
Apache License 2.0
103 stars 2 forks source link

Add parameter to post feed endpoint #17

Closed himynamesdave closed 3 months ago

himynamesdave commented 3 months ago

I've noticed some RSS feeds link to other blogs in the <link> property

e.g.

https://grahamcluley.com/category/security-threats/ransomware-malware/feed/

        <item>
        <title>Better resilience sees more extorted companies refuse to pay their ransomware attackers</title>
        <link>https://www.tripwire.com/state-of-security/better-resilience-sees-more-extorted-companies-refuse-pay-their-ransomware</link>
                    <comments>https://www.tripwire.com/state-of-security/better-resilience-sees-more-extorted-companies-refuse-pay-their-ransomware#respond</comments>

        <dc:creator><![CDATA[Graham Cluley]]></dc:creator>
        <pubDate>Fri, 28 Jun 2024 09:05:22 +0000</pubDate>
                <category><![CDATA[Data loss]]></category>
        <category><![CDATA[Guest blog]]></category>
        <category><![CDATA[Ransomware]]></category>
        <category><![CDATA[data breach]]></category>
        <category><![CDATA[ransomware]]></category>
        <guid isPermaLink="false">https://grahamcluley.com/?p=16344885</guid>

                    <description><![CDATA[There's some possibly good news on the ransomware front.

Companies are becoming more resilient to attacks, and the ransom payments extorted from businesses by hackers are on a downward trend.

Read more in my article on the Tripwire State of Security blog.]]></description>

                    <wfw:commentRss>https://www.tripwire.com/state-of-security/better-resilience-sees-more-extorted-companies-refuse-pay-their-ransomware/feed/</wfw:commentRss>
            <slash:comments>0</slash:comments>

            </item>

This is usually done when companies pay the blog to promote their own posts.

That said this is problematic for most of our use cases

To avoid this, we should add an option where the user can select to only download blog posts in the link that match the domain of the rss feed.

e.g. if feed is

https://grahamcluley.com/category/security-threats/ransomware-malware/feed/

then

<link>https://grahamcluley.com/smashing-security-podcast-378/</link> would be downloaded <link>https://www.tripwire.com/state-of-security/better-resilience-sees-more-extorted-companies-refuse-pay-their-ransomware</link> would not

add this as a setting in post feed

include_remote_blogs as a boolean (default is false). If set to true, will try and download remote sites.

Also include this data in any job run for this feed, so that user can remember if it was set or not when checking a job