robertoszek / pleroma-bot

Bot for mirroring one or multiple Twitter accounts in Pleroma/Mastodon/Misskey.
https://robertoszek.github.io/pleroma-bot
MIT License
103 stars 19 forks source link

Nitter RSS: Handle over-processed links and metadata #131

Open nemobis opened 1 year ago

nemobis commented 1 year ago

Using the RSS import option with Nitter works quite well, but the resulting posts are hard to read because every hashtag

Also, the nitter_base_url isn't applied because the links to the original post go to the URL provided by the RSS feed rather than to the original.

An example RSS feed from an instance running 2023.05.30-38985af is:

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <atom:link href="https://nitter.lacontrevoie.fr/CopernicusEU/rss" rel="self" type="application/rss+xml" />
    <title>Copernicus EU / @CopernicusEU</title>
    <link>https://nitter.lacontrevoie.fr/CopernicusEU</link>
    <description>Twitter feed for: @CopernicusEU. Generated by nitter.lacontrevoie.fr
</description>
    <language>en-us</language>
    <ttl>40</ttl>
    <image>
      <title>Copernicus EU / @CopernicusEU</title>
      <link>https://nitter.lacontrevoie.fr/CopernicusEU</link>
      <url>https://nitter.lacontrevoie.fr/pic/pbs.twimg.com%2Fprofile_images%2F1629950827925315587%2F3gnlK62Y_400x400.jpg</url>
      <width>128</width>
      <height>128</height>
    </image>
    <item>
      <title>#Copernicus for #wildfire monitoring

Our #OpenData provides high-resolution imagery to monitor fires🔥 around the world, especially in sensitive &amp; protected ecosystems

On 23 June, our #Sentinel2🇪🇺🛰️ satellite captured this fire in Mexico&apos;s Pantanos de Centla Biosphere Reserve🇲🇽</title>
      <dc:creator>@CopernicusEU</dc:creator>
      <description><![CDATA[<p><a href="https://nitter.lacontrevoie.fr/search?q=%23Copernicus">#Copernicus</a> for <a href="https://nitter.lacontrevoie.fr/search?q=%23wildfire">#wildfire</a> monitoring<br>
<br>
Our <a href="https://nitter.lacontrevoie.fr/search?q=%23OpenData">#OpenData</a> provides high-resolution imagery to monitor fires🔥 around the world, especially in sensitive &amp; protected ecosystems<br>
<br>
On 23 June, our <a href="https://nitter.lacontrevoie.fr/search?q=%23Sentinel2">#Sentinel2</a>🇪🇺🛰️ satellite captured this fire in Mexico's Pantanos de Centla Biosphere Reserve🇲🇽</p>
<img src="https://nitter.lacontrevoie.fr/pic/media%2FFzdBZyGWYAIYkDm.jpg" style="max-width:250px;" />]]></description>
      <pubDate>Sun, 25 Jun 2023 08:01:39 GMT</pubDate>
      <guid>https://nitter.lacontrevoie.fr/CopernicusEU/status/1672877701554659329#m</guid>
      <link>https://nitter.lacontrevoie.fr/CopernicusEU/status/1672877701554659329#m</link>
    </item>  </channel>
</rss>
nemobis commented 1 year ago

Example output: https://respublicae.eu/@EURLex/110603986431571007

[nitter.lacontrevoie.fr/search?q=%23OnThisDay](https://nitter.lacontrevoie.fr/search?q=%23OnThisDay)
in 1990, signature of the
https://nitter.lacontrevoie.fr/search?q=%23Schengen
Convention between 🇫🇷,🇧🇪,🇩🇪, 🇱🇺 &amp;  🇳🇱
Together with the agreement of 1985 &amp; accession agreements, it forms the
https://nitter.lacontrevoie.fr/search?q=%23SchengenAcquis
allowing over 400 million people to travel freely without border controls  ➡️
https://europa.eu/!whNGXQ

(Also shows an &amp;.)

nemobis commented 1 year ago

And for usernames, https://respublicae.eu/@EURLex/110603985985909102 :

.
https://nitter.lacontrevoie.fr/EUCouncil
has adopted a resolution on
https://nitter.lacontrevoie.fr/search?q=%23customs
cooperation in the area of
https://nitter.lacontrevoie.fr/search?q=%23lawenforcement
and its contribution to the
https://nitter.lacontrevoie.fr/search?q=%23internalsecurity
of the EU
👉
https://europa.eu/!4KHvWh
nemobis commented 1 year ago

Also I'm not sure it's useful to prefix "RT by " rather than just "RT", as in https://respublicae.eu/@EURLex/110603985605674224:

RT by [@EURLex](https://respublicae.eu/@EURLex): .[@EUinNL](https://respublicae.eu/@EUinNL) [#tenders](https://respublicae.eu/tags/tenders) - Netherlands-The Hague: Security Guard and Reception/Switchboard Services for the Premises in the Netherlands - 17/07/2023 - https://europa.eu/!xcTcYg
nemobis commented 1 year ago

Some of the issues were already reported at https://github.com/robertoszek/pleroma-bot/issues/122 and fixed in the test version, will need to check again.

nemobis commented 1 year ago

The current state can be seen at https://respublicae.eu/@EURLex (using 1.1.1rc58).

nemobis commented 1 year ago

The author field can be missing too


The above exception was the direct cause of the following exception:                                                                                                                                                                                                                                                                                              
Traceback (most recent call last):
  File "/home/7/federico/mastodon/bot/lib/python3.9/site-packages/pleroma_bot/cli.py", line 623, in main
    tweets_rss = user.parse_rss_feed(
  File "/home/7/federico/mastodon/bot/lib/python3.9/site-packages/pleroma_bot/_utils.py", line 850, in parse_rss_feed
    for idx, res in enumerate(                                                                                        
  File "/usr/lib/python3.9/multiprocessing/pool.py", line 870, in next                                               
    raise value                
AttributeError: object has no attribute 'author'   
nemobis commented 1 year ago

Should probably also drop the HTML markup in posts like https://kolektiva.social/@pyorapajahelsinki/111443408564161235 (but this is from a Telegram feed: https://rsshub.app/telegram/channel/pyorapaja ).