milanmdev / bsky.rss

A configurable RSS poster for Bluesky
MIT License
31 stars 8 forks source link

Feature request : Media Embed when Image/video are not available via OpenGraph #27

Closed benborges closed 7 months ago

benborges commented 1 year ago

I have an odd RSS feed, it's constructed out of Twitter/X users, only their media posts, using RSShub (no API usage)

The original RSS feeds looks like this : https://rsshub.app/twitter/media/Defmon3/

I aggregate many osint users into one inoreader folder and the RSS looks like this : RSS https://www.inoreader.com/stream/user/1005072895/tag/OSINTbridge/ JSON https://www.inoreader.com/stream/user/1005072895/tag/OSINTbridge/view/json HTML view : https://www.inoreader.com/stream/user/1005072895/tag/OSINTbridge/view/html?t=OSINTbridge&cs=m&sb=y

In a Inoreader usecase such as this one, the media file is on the description field of the RSS feed

I tried to fetch this link and store it in a item inside the RSS channel but I didn't manage to be able to post image/videos from this field This feed is taken from the JSON link above and I manipulate it with N8N to construct the RSS feed the way I want I removed the media elements from my feed, so that the bluesky bot can post using the description field safely. https://webhook.ukrainewararchive.eu/webhook/osint.rss

So I was wondering, would it be complicated, doable to default on the media link (video or image) on the Description field if there is nothing to fetch on the OpenGraph meta of the source link ?

Secondary question : would Bluesky API allow to upload media files from the enclosure item of an RSS feed ? (going to explore this question reading their doc, but perhaps you have some idea about this already)

milanmdev commented 1 year ago

In the new queue branch, I have it set to take metadata information from the RSS feed if OpenGraph data is not available. The RSS spec does have a field for images, but I don't know of a lot of feeds that use it.

I didn't fully understand, but I think this is my answer to what you were asking: It would be sort of complicated to scan the description field for media links and then feed those to Bluesky. I don't think that there are a lot of use cases where that would be needed, so it wouldn't be worth adding it. Now, you could use/make something that looks at the RSS feed, and then creates a new RSS feed with the <image> tag and then fee that to the bot, and I can add support for the image tag.

Hopefully, I understood that right. Correct me if I got anything wrong.

benborges commented 1 year ago

Yes you got me right, and your proposal should work, I could reformat the rss feed to get the media URL to be an element tag, I will work on a demo RSS feed and paste it here once I'm done

rmdes commented 1 year ago

I tried to replicate this with this bot

the RSS feed has an enclosure tag with even the size of the images, I'm wondering if this could be integrated

view-source:rappel.conso.gouv.fr/rss <enclosure url>not sure it's very standard, but I'm seeing this in a bunch of different feeds for images tho

I also tried to recondition this RSS feed by having the enclosure image url content wrapped into an <image>tag but I had issues with my reconstructed feed with the datefield not being found (all tho I was using pubDate)

milanmdev commented 1 year ago

Can you send me the example for one of your reconstructed items in the RSS feed?

rmdes commented 1 year ago

Can you send me the example for one of your reconstructed items in the RSS feed?

Here is the link

I have read some more documentation and apparently, the best approach to add an image file to an item is to in fact use the image tag, not enclosure, plus with enclosure, to respect the specs, you're supposed to deliver also the lenght/size of things which is not necessary in the case of images, in short the image tag is simpler to use for item that you want to associate an image to and the enclosure is obviously the proper way to handle podcast/video/audio

rmdes commented 1 year ago

check this out, I was talking to a contact about this issue for this particular RSS feed and he went on and added a new Config option for images, is this how you would play out this request?

https://github.com/garaytc/bsky.rss/commit/2a4e0cb4168a4dee9cf5077b8f274a82ffb45a99

edit confirmed to be working

had to fill the Config variable with media:content like my image element

milanmdev commented 1 year ago

Feature added to queue branch with the use of imageField: ""

rmdes commented 1 year ago

So this is working (and it works with anything (media:content on my reconstructed RSS feed, or enclosure from a vanilla RSS feed) BUT

But some reasons, I'm not seeing the pooling/queuing system at work, it fetched new items and posted them right away and so overflown the limit with this error :

bsky-rss-sciencesFR  | (node:30) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 22 terminated listeners added to [Fetch]. Use emitter.setMaxListeners() to increase limit
bsky-rss-sciencesFR  | (Use `node --trace-warnings ...` to show where the warning was created)
bsky-rss-sciencesFR  | /build/node_modules/@atproto/xrpc/src/client.ts:126
bsky-rss-sciencesFR  |         throw new XRPCError(resCode, res.body.error, res.body.message)
bsky-rss-sciencesFR  |               ^
bsky-rss-sciencesFR  | XRPCError: too many concurrent writes
bsky-rss-sciencesFR  |     at ServiceClient.call (/build/node_modules/@atproto/xrpc/src/client.ts:126:15)
bsky-rss-sciencesFR  |     at processTicksAndRejections (node:internal/process/task_queues:95:5)
bsky-rss-sciencesFR  |     at async PostRecord.create (/build/node_modules/@atproto/api/src/client/index.ts:1519:17) {
bsky-rss-sciencesFR  |   status: 400,
bsky-rss-sciencesFR  |   error: 'ConcurrentWrites',
bsky-rss-sciencesFR  |   success: false
bsky-rss-sciencesFR  | }
bsky-rss-sciencesFR  | error Command failed with exit code 1.
bsky-rss-sciencesFR  | info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
rmdes commented 1 year ago

Just found a case, with this RSS feed, that when using the imageField config, even tho the image is present in the feed, it does not get posted but also does not generate any error.

I used enclosure as image field, the same way I used it with this feed which is working hust fine, both use the enclosure tag for images

Oh and I double checked, when I use the current queue branch state to generate my local docker image, the queue system, is not working as it was, just before the merging of the ImageField feature.

milanmdev commented 1 year ago

I cannot replicate the rate-limiting issue, using https://feeds.simplecast.com/54nAGcIl (and posting all of that to Bluesky). I'll look into the imageField issue.

EDIT: Pushed a possible fix for the rate-limiting issue. EDIT 2: Images are posting fine from the first link you posted in your comment above. My config is "imageField": "enclosure".

rmdes commented 1 year ago

Found my issue, I accidentally did a docker build -t my-docker image . inside a git repo on the main branch, and called it bsky-queue and then used this image on different bots, basically creating my own problem while testing it!!!

So, the queue branch is fine, along with the ImageField, it's all working perfectly and sites/RSS that previously did not have images are now able to post with images just fine.

rmdes commented 1 year ago

just clarify on this, this this error only happened because the bot was running on the main branch, my bad.

milanmdev commented 1 year ago

Ah, I see.

rmdes commented 1 year ago

One thing, after testing with some other uses case, I think it would be best if the fetching of the image was outside the opengraph loop

can be reproduced with this feed : https://reporterre.net/spip.php?page=backend-simple

rmdes commented 1 year ago

Yes, we definitely need to change how the first image option is taken, it should be OpenGraph first and if not, then enclosure

edit: did some more digging and the feeds where this happen are feeds that do not have native image or enclosure tag as an element in the feed but rather have the image as a img src inside the description of each item in the feed. that's the case for a LOT of feeds out there, i'll repackage some of my feeds to control this part and keep my bots operational but i'm wondering if this use case could be integrated ?

image

milanmdev commented 1 year ago

Moved the image fetching for RSS-provided images outside of the Open Graph fetching on the latest commit to the queue branch.

As for parsing for img src in descriptions of feeds, like I've stated before that's not in-spec at all, so I don't see it making sense to add it. Personally, I've not come across many, if at all, any feeds that do that. If you would like to get images from the description, I'd suggest fetching the feed and rewriting the data in a standard format, or creating a PR with your suggested implementation of this.

rmdes commented 1 year ago

Alright, will probably be going to the route of rewriting these RSS feeds for the feed concerned (French & Belgian media CMS's with odd implementation of RSS specs)

Thanks very much !

benborges commented 1 year ago

Moved the image fetching for RSS-provided images outside of the Open Graph fetching on the latest commit to the queue branch.

As for parsing for img src in descriptions of feeds, like I've stated before that's not in-spec at all, so I don't see it making sense to add it. Personally, I've not come across many, if at all, any feeds that do that. If you would like to get images from the description, I'd suggest fetching the feed and rewriting the data in a standard format, or creating a PR with your suggested implementation of this.

I have tried this on a few bots, but with or without "imageField": "enclosure" I can't get it to pick the OpenGraph like it used to, so I end up with no images at all, even if the queue is working properly.

milanmdev commented 1 year ago

Odd, mind giving me the RSS feed you're using?

benborges commented 1 year ago
[Sat, 26 Aug 2023 19:14:02 GMT] - [bsky.rss QUEUE] Queuing item (In case anyone wanted a good laugh this morning)
/build/app/utils/rssHandler.ts:63
        let imageUrl: string = openGraphData.ogImage[0].url;
                                             ^
TypeError: Cannot read properties of undefined (reading '0')
    at FeedSub.<anonymous> (/build/app/utils/rssHandler.ts:63:46)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Node.js v18.17.0
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
yarn run v1.22.19
$ tsx ./app/index.ts
[Sat, 26 Aug 2023 19:14:22 GMT] - [bsky.rss APP] Started RSS reader. Fetching from https://www.inoreader.com/stream/user/1004571328/tag/Trump 

RSS feed

milanmdev commented 1 year ago

Should be fixed now

benborges commented 1 year ago

Should be fixed now

No more errors, but no images embed, on the last git pull & rebuilt local docker image

milanmdev commented 1 year ago

Can you send me your config.json? Posts were working fine for me. image

benborges commented 1 year ago

config.json for this bot

{
  "string": "$title",
  "publishEmbed": true,
  "languages": ["en"],
  "truncate": true,
  "runInterval": 60,
  "imageField": "enclosure",
  "dateField": ""
}
milanmdev commented 1 year ago

The RSS feed you provided doesn't have an enclosure feed for items, and when you provide a value for imageField in the config, the application looks for that and ignores Open Graph images.

benborges commented 1 year ago

The RSS feed you provided doesn't have an enclosure feed for items, and when you provide a value for imageField in the config, the application looks for that and ignores Open Graph images.

Ohh it ignores it? I thought it was checking one or the other, but OpenGraph first ok, so if my feed is not concerned by missing opengraph tags, I should not be using the new Imagefield, correct ?

milanmdev commented 1 year ago

imageField should only be used if you want to strictly take images from the RSS feed posts and never from Open Graph. If the feed doesn't consistently post images, then using Open Graph to fetch images will probably be a better option than using imageField

benborges commented 1 year ago

Understood !

I'm testing now without imageField, using :

`image: ghcr.io/milanmdev/bsky.rss:queue-2d5e0b6`

Output

RSS feed


{
  "string": "$title",
  "publishEmbed": true,
  "languages": ["en"],
  "truncate": true,
  "runInterval": 60,
  "dateField": ""
}

Same Image issue basically than with the Trump bot here both bots were getting image just fine still today

benborges commented 1 year ago

Can you send me your config.json? Posts were working fine for me. image

So I moved all my bots to your image, removed any imageField from config.json, unless the feed is unique and has an enclosure for image on its own, then I docker-compose up and it's all getting posted, but no image where there was images previously, before the imagefield was merged I guess ?

here or here or here

for each of these, the config.json is equal to https://github.com/milanmdev/bsky.rss/issues/27#issuecomment-1694494723

milanmdev commented 1 year ago

Pushed a fix. Available in ghcr.io/milanmdev/bsky.rss:queue-002738b

benborges commented 1 year ago

Pushed a fix. Available in ghcr.io/milanmdev/bsky.rss:queue-002738b

Thanks a lot for the fixes!!, redeployed with this image and it's now properly running, with images !

rmdes commented 1 year ago

Confirm that everything is running neat also on my side, beside two multi-feeds, that have sources with their own lack of OpenGraph support

producing this error :


[Sun, 27 Aug 2023 11:57:18 GMT] - [bsky.rss QUEUE] Starting queue handler. Running every 60 seconds
/build/app/utils/rssHandler.ts:64
          let imageUrl: string = openGraphData.ogImage[0].url;
                                               ^
TypeError: Cannot read properties of undefined (reading '0')
    at FeedSub.<anonymous> (/build/app/utils/rssHandler.ts:64:48)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Node.js v18.17.1
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
error Command failed with exit code 1.

Which second @garaytc can be fixed with this PR https://github.com/milanmdev/bsky.rss/pull/37

rmdes commented 1 year ago

Tested @garaytc queue branch directly and it does fix this issue. (he's yet to push to here though)