shouya / rss-funnel

Self-hosted RSS multi-tool
https://rss-funnel-demo.fly.dev
GNU General Public License v3.0
113 stars 4 forks source link

error fetching full text: HTTP status error 403 Forbidden #145

Closed MrCaringi closed 1 month ago

MrCaringi commented 1 month ago

Hi, for some feeds, I am getting this kind of errors in every item in the feed:

error fetching full text: HTTP status error 403 Forbidden (url: https://lat.motorsport.com/f1/news/hamilton-interes-motogp-equipo-liberty/10653282/?utm_source=RSS&utm_medium=referral&utm_campaign=RSS-MOTOGP&utm_term=News&utm_content=lat)

this is my current config file funnel.yml:

endpoints:
  - path: /full-text.xml
    note: Full text of any Source
    filters:
      - full_text: {}
      - simplify_html: {}

feed: https://lat.motorsport.com/rss/motogp/news/

this instance is installed in Ubuntu 22.04.4 LTS aarch64, with docker compose

Screenshot:

2024-09-11 19 52 50

Thanks in advance for your support!

shouya commented 1 month ago

Apparently the webpage blocks non-browser requests. Compare the output of:

$ curl -I -H 'user-agent: random' 'https://lat.motorsport.com/f1/news/hamilton-interes-motogp-equipo-liberty/10653282/?utm_source=RSS&utm_medium=referral&utm_campaign=RSS-MOTOGP&utm_term=News&utm_content=lat'
HTTP/2 403
...

and

$ curl -I -H 'user-agent: Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0' 'https://lat.motorsport.com/f1/news/hamilton-interes-motogp-equipo-liberty/10653282/?utm_source=RSS&utm_medium=referral&utm_campaign=RSS-MOTOGP&utm_term=News&utm_content=lat'
HTTP/2 200
...

So it should be possible to get around the block by setting full_text.client.user_agent to a browser user-agent. The resulting config should look like this:

endpoints:
  - path: /full-text.xml
    note: Full text of any Source
    filters:
      - full_text:
          client:
            user_agent: "Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0"
      - simplify_html: {}
MrCaringi commented 1 month ago

thanks a lot!!!! let my try it

MrCaringi commented 1 month ago

I updated my funnel.yml config file, now it looks like this:

endpoints:
  - path: /full-text.xml
    note: Full text of any Source
    filters:
      - full_text: {}
      - simplify_html: {}

  - path: /agent.xml
    note: Full with user-agent
    filters:
      - full_text:
          client:
            user_agent: "Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0"
      - simplify_html: {}

When I applied this new config, I got this error:

rss-funnel-1 exited with code 1
rss-funnel-1  | 2024-09-13T01:47:13.616608Z  INFO rss_funnel::server: loading config from "/funnel.yaml"
rss-funnel-1  | Error: Config(Yaml(Error("endpoints[1]: missing field `timeout`", line: 8, column: 5)))

So, I was looking for this parameter in your repository, and found this: https://github.com/shouya/rss-funnel/blob/29141c5c351f21031fb12c0b0704840076f6f3cd/src/client.rs#L53

So, I try this config (I added the timeout parameter):

endpoints:
  - path: /full-text.xml
    note: Full text of any Source
    filters:
      - full_text: {}
      - simplify_html: {}

  - path: /agent.xml
    note: Full with user-agent
    timeout: "2m"
    filters:
      - full_text:
          client:
            user_agent: "Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0"
      - simplify_html: {}

But this didnt make the trick, I have the same error:

[+] Running 2/2
 ✔ Network dc-rss-funnel_default  Created                                                                                                                                                              0.1s 
 ✔ Container rss-funnel           Started                                                                                                                                                              0.3s 
rss-funnel exited with code 1
rss-funnel  | 2024-09-13T01:53:14.248547Z  INFO rss_funnel::server: loading config from "/funnel.yaml"
rss-funnel  | Error: Config(Yaml(Error("endpoints[1]: missing field `timeout`", line: 8, column: 5)))

Any idea what I am doing wrong?

shouya commented 1 month ago

Sorry it's a bug. The timeout field is supposed be optional. I will fix it on next release.

For the time being you can manually specify the timeout as follows:

  - path: /agent.xml
    note: Full with user-agent
    filters:
      - full_text:
          client:
            user_agent: "Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0"
            timeout: "20s"
      - simplify_html: {}
MrCaringi commented 1 month ago

Thanks! It works!!!!

Thanks a lot!

[+] Running 2/2
 ✔ Network dc-rss-funnel_default  Created                                                                                                                                                                                       0.1s 
 ✔ Container rss-funnel           Started                                                                                                                                                                                       0.4s 
rss-funnel  | 2024-09-14T00:00:36.199292Z  INFO rss_funnel::server: loading config from "/funnel.yaml"
rss-funnel  | 2024-09-14T00:00:36.211102Z  INFO rss_funnel::server::feed_service: loaded endpoint: /full-text.xml
rss-funnel  | 2024-09-14T00:00:36.211517Z  INFO rss_funnel::server::feed_service: loaded endpoint: /agent.xml
rss-funnel  | 2024-09-14T00:00:36.211690Z  INFO rss_funnel::server: listening on 0.0.0.0:4080
rss-funnel  | 2024-09-14T00:00:36.215572Z  INFO rss_funnel::server::image_proxy: handling image proxy: /_image
rss-funnel  | 2024-09-14T00:00:36.216381Z  INFO rss_funnel::server: starting server

before user-agent (path /full-text.xml)

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Motorsport.com - MotoGP - Historias</title>
    <link>https://lat.motorsport.com/motogp/news/?utm_source=RSS&amp;utm_medium=referral&amp;utm_campaign=RSS-MOTOGP&amp;utm_term=News&amp;utm_content=lat</link>
    <description></description>
    <pubDate>Sat, 14 Sep 2024 00:02:27 +0000</pubDate>
    <generator>Zend_Feed</generator>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <ttl>100</ttl>
    <atom:link href="http://lat.motorsport.com/rss/motogp/news/" rel="self" type="application/rss+xml">
    </atom:link>
    <item>
      <title>Valentino Rossi, muy duro contra Márquez: &quot;Nadie fue tan sucio como él&quot;</title>
      <link>https://lat.motorsport.com/motogp/news/valentino-rossi-marquez-nadie-tan-sucio/10653438/?utm_source=RSS&amp;utm_medium=referral&amp;utm_campaign=RSS-MOTOGP&amp;utm_term=News&amp;utm_content=lat</link>
      <description><![CDATA[Andrea Mingo, miembro de la VR46 Riders Academy, se quedó sin moto en el Mundial y el equipo de Valentino Rossi lo ha mantenido como 'coach', además de animador y conductor del podcast 'Mig Babol', que el #46 está aprovechando para rememorar viejas luchas, ya que recientemente también explicó su enfrentamiento con Max Biaggi.Esta vez se centró en su desencuentro con Marc Márquez y lo ...<a href="https://lat.motorsport.com/motogp/news/valentino-rossi-marquez-nadie-tan-sucio/10653438/?utm_source=RSS&amp;utm_medium=referral&amp;utm_campaign=RSS-MOTOGP&amp;utm_term=News&amp;utm_content=lat">Sigue leyendo</a><br><br><p>
error fetching full text: HTTP status error 403 Forbidden (url: https://lat.motorsport.com/motogp/news/valentino-rossi-marquez-nadie-tan-sucio/10653438/?utm_source=RSS&amp;utm_medium=referral&amp;utm_campaign=RSS-MOTOGP&amp;utm_term=News&amp;utm_content=lat)</p>]]></description>
      <category>MotoGP</category>
      <enclosure url="https://cdn-5.motorsport.com/images/amp/2QzBkPNY/s6/valentino-rossi-yamaha-factory.jpg" length="246945" type="image/jpeg"/>
      <guid isPermaLink="false">10653438</guid>
      <pubDate>Thu, 12 Sep 2024 12:36:18 +0000</pubDate>
    </item>

After, with user-agent (path: /agent.xml)

?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Motorsport.com - MotoGP - Historias</title>
    <link>https://lat.motorsport.com/motogp/news/?utm_source=RSS&amp;utm_medium=referral&amp;utm_campaign=RSS-MOTOGP&amp;utm_term=News&amp;utm_content=lat</link>
    <description></description>
    <pubDate>Sat, 14 Sep 2024 00:02:27 +0000</pubDate>
    <generator>Zend_Feed</generator>
    <docs>http://blogs.law.harvard.edu/tech/rss</docs>
    <ttl>100</ttl>
    <atom:link href="http://lat.motorsport.com/rss/motogp/news/" rel="self" type="application/rss+xml">
    </atom:link>
    <item>
      <title>Valentino Rossi, muy duro contra Márquez: &quot;Nadie fue tan sucio como él&quot;</title>
      <link>https://lat.motorsport.com/motogp/news/valentino-rossi-marquez-nadie-tan-sucio/10653438/?utm_source=RSS&amp;utm_medium=referral&amp;utm_campaign=RSS-MOTOGP&amp;utm_term=News&amp;utm_content=lat</link>
      <description><![CDATA[<p>Andrea Mingo, miembro de la VR46 Riders Academy, se quedó sin moto en el Mundial y el equipo de <a rel="noopener" href="https://lat.motorsport.com/driver/valentino-rossi/464484/" target="_blank">Valentino Rossi</a> lo ha mantenido como 'coach', además de animador y conductor del podcast 'Mig Babol', que el #46 está aprovechando para rememorar viejas luchas, ya que recientemente también explicó su enfrentamiento con Max Biaggi.</p><p>Esta vez se centró en su desencuentro con <a rel="noopener" href="https://lat.motorsport.com/driver/marc-marquez/463649/" target="_blank">Marc Márquez</a> y lo sucedido en 2015 y a partir de ahí. Paso a paso, Rossi fue reconstruyendo su versión de los hechos, vomitando todo su odio contra el corredor de Cervera, que en 2019 sumó su octava corona mundial y le acecha en el palmarés.</p><p>"Es lo más feo que me ha pasado a nivel deportivo absolutamente", recuerda lo sucedido en Argentina, a principios de aquel año. "La disputa con Márquez había empezado en Argentina. Él había elegido el neumático medio trasero, yo el duro. Se había escapado, pero me recuperé y lo alcancé. Llegué a él e iba mucho más rápido, así que para mí fue fácil adelantarlo. Le tomé el rebufo en la recta después de la curva 3 y frené bien para adelantarlo. Llegué, entré en la curva de derechas y hasta ahí siempre nos habíamos llevado bien, pero se me echó encima a fondo", recuerda Rossi con detalle en el podcast.</p><p>"Lo pasé y él pensó que la única oportunidad que tenía era chocar conmigo. Intentó derribarme enseguida, vino deliberadamente a por mí para intentar tirarme. No quería perder. Volví a mi línea, desafortunadamente nos tocamos. Tú me la das, yo te la devuelvo. Entonces (Marc) se cayó. A partir de ahí nuestra relación se vino abajo. A pesar de ese episodio, siguió pretendiendo llevarse bien conmigo y besarme el culo", se mofa el italiano.</p><section data-custom="false" data-author="" data-title="" draggable="true" data-id="40770010" data-show-author="true" data-show-title="true" data-link="" data-widget="image" data-src="//cdn.motorsport.com/images/mgl/Y99z13AY/s8/valentino-rossi-yamaha-factor-1.jpg" data-height="534" contenteditable="false" data-width="800"><picture><source srcset="https://cdn.motorsport.com/images/mgl/Y99z13AY/s200/valentino-rossi-yamaha-factor-1.webp%20200w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s300/valentino-rossi-yamaha-factor-1.webp%20300w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s400/valentino-rossi-yamaha-factor-1.webp%20400w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s500/valentino-rossi-yamaha-factor-1.webp%20500w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s600/valentino-rossi-yamaha-factor-1.webp%20600w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s700/valentino-rossi-yamaha-factor-1.webp%20700w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s800/valentino-rossi-yamaha-factor-1.webp%20800w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s900/valentino-rossi-yamaha-factor-1.webp%20900w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s1000/valentino-rossi-yamaha-factor-1.webp%201000w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s1100/valentino-rossi-yamaha-factor-1.webp%201100w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s1200/valentino-rossi-yamaha-factor-1.webp%201200w" type="image/webp" sizes="(min-width: 650px) 700px"><source sizes="(min-width: 650px) 700px" srcset="https://cdn.motorsport.com/images/mgl/Y99z13AY/s200/valentino-rossi-yamaha-factor-1.jpg%20200w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s300/valentino-rossi-yamaha-factor-1.jpg%20300w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s400/valentino-rossi-yamaha-factor-1.jpg%20400w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s500/valentino-rossi-yamaha-factor-1.jpg%20500w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s600/valentino-rossi-yamaha-factor-1.jpg%20600w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s700/valentino-rossi-yamaha-factor-1.jpg%20700w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s800/valentino-rossi-yamaha-factor-1.jpg%20800w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s900/valentino-rossi-yamaha-factor-1.jpg%20900w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s1000/valentino-rossi-yamaha-factor-1.jpg%201000w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s1100/valentino-rossi-yamaha-factor-1.jpg%201100w,%20//cdn.motorsport.com/images/mgl/Y99z13AY/s1200/valentino-rossi-yamaha-factor-1.jpg%201200w" type="image/jpeg"><img height="800" loading="lazy" draggable="false" width="1200" alt="" src="https://cdn.motorsport.com/images/mgl/Y99z13AY/s1000/valentino-rossi-yamaha-factor-1.jpg"></picture> ...

@shouya Feel free to close this issue! thanks a lot!