nanos / FediFetcher

FediFetcher is a tool for Mastodon that automatically fetches missing replies and posts from other fediverse instances, and adds them to your own Mastodon instance.
https://blog.thms.uk/fedifetcher?utm_source=github
MIT License
309 stars 230 forks source link

add hacky support for misskey and calckey/firefish #66

Closed ToadKing closed 1 year ago

ToadKing commented 1 year ago

Fixes #60

This implementation is not very robust and doesn't actually do the work to detect the server type to use (like mentioned in #60) and just relies on fallbacks. Also, in testing it seems like the url field for notes on Misskey/Firefish servers aren't actually filled in so they have to be created manually from the server and note ID, at least on the two servers I tested (misskey.io and calckey.social).

nanos commented 1 year ago

Thanks so much for this @ToadKing !

I would love even hacky support for firefish/calckey!

I must be honest, though, that I'm not sure I like relying on failures and fallbacks: I already got a few comments from owners of servers that run un-supported software (e.g. WordPress with ActivityPub) that FediFetcher users are hammering their servers for no good reason. Not sure they'd look too kindly upon us hammering their servers twice as much (once to try Mastodon, once to try Firefish) 😬

I will try this branch on my instance though, to see how it goes, because I'd still love Firefish support!

ToadKing commented 1 year ago

I've already discovered a shortcoming in the changes: the notes/children API only returns immediate children notes of the parent, not any deeper threads. The web interface seems to do explicit calls for every subsequent note to check for deeper threads which my changes currently do not do. There will obviously have to be a limit on this but it's something that could be done, especially since we also get a count of replies to a note so we can only check for deeper threads when we know they're there.

I've also been working on further work into probing nodeinfo for server software for better API detection. It's still a work in progress and I'm not sure if that should go into this PR or some different one. The work is currently here: https://github.com/ToadKing/FediFetcher/tree/server-detection

nanos commented 1 year ago

That's quite annoying having to query recursively. Might need to limit to a few levels deep indeed.

nice work on the API detection too!

ToadKing commented 1 year ago

So I finished hooking up the rest of the server software detection stuff and found a way to get comments at depth for FireFish. It doesn't work for Misskey though but I think it's good enough for now.

The one problem is there's no real automatic way to detect the API a piece of software uses so a list in the script will need to be kept up to date. This brings up a question though: Should unknown software default to using the Mastodon API or should we just throw errors when those servers are found? Right now I do the latter but I figure I'd ask what you think is best.

nanos commented 1 year ago

This is amazing! Thanks for your work @ToadKing

I agree with your approach on erroring when we don't recognise the supported API. Gives us a chance to add to the list, and in all likelyhood, it won't support either API in that case.

As this is a big PR I'll have a closer look at this later, but it looks really good so far!

nanos commented 1 year ago

This is quite interesting: I had not expected the time it took for FediFetcher to run to be quite that much longer: Without these changes it takes about 2-3 minutes to run. On this branch, it takes about 8-11 minutes.

It's not a problem, but an interesting observation.

I wonder if we could cache the instance info on disk, in a future development, to speed it up a little.

Overall really solid work though @ToadKing! I intend to merge this later this week. Thank you!

ToadKing commented 1 year ago

Wow, that's odd. I would expect the extra lookups to take some extra time but not that much. However I did notice some servers (like firefish.social) take a long time to fetch the nodeinfo page and even occasionally timeout.

Is it possible to benchmark how much time is spent in get_server_info and get_nodeinfo? (I'm sure there is but I'm very new to Python.) If is turns out that actually is creating a bottleneck it might be worthwhile actually caching that info, at least with a timeout date for them. Making sure we actually have the most up-to-date software version isn't strictly necessary right now as long as servers don't migrate to different software with different APIs but it might become necessary in the future if we do different behavior based on software version as well.

nanos commented 1 year ago

Is it possible to benchmark how much time is spent in get_server_info and get_nodeinfo? (I'm sure there is but I'm very new to Python.)

I've just done that:

2 min, 35 sec, out of a total runtime of 5 min 41 sec was spent on get_server_info, which probably isn't surprising, given that it requires multiple HTTP calls.

Making sure we actually have the most up-to-date software version isn't strictly necessary right now as long as servers don't migrate to different software with different API

Imho that's not really something we need to cater for: servers switching software while maintaining the same host name should be a very rare exception, if for no other reason that I don't think Mastodon itself would handle this very well...

There probably should be some timeline on how long we'd cache this, but I think several weeks would be totally acceptable.

ToadKing commented 1 year ago

I didn't write the explicit rate limit checks because it looks like get (and my new post) function automatically handle rate limits being hit. In fact, I'm not sure the other Mastodon/Lemmy functions need those checks either because of that. Am I right in thinking that?

nanos commented 1 year ago

Oops. Yes, you are correct! My bad.

nanos commented 1 year ago

As you can see I've implemented server info caching. The cache period is configurable, but defaults to 30 days.

This has brought my processing time down to 1-4 minutes again, which I'm much happier with, and if someone doesn't want to cache it, they can set --remember-hosts-for-days to 0.

nanos commented 1 year ago

@ToadKing do you think this is ready to merge now?

ToadKing commented 1 year ago

Yeah, looks good to me. :+1:

MrHamel commented 1 year ago

Unfortunately it doesn't look good for me on my CalcKey/FireFish instance, and I'm a bit unhappy that my issue was closed without asking me if it has actually resolved my problem.

This is also after regenerating the access token for my account. Below is the my config and after that is the spewing of 401 unauthorized errors.

{
  "access-token": "(removed)",
  "server": "calckey.club",
  "home-timeline-length": 500,
  "max-followings": 160,
  "from-notifications": 1
}
2023-08-05 16:19:50.738265 UTC: Error adding url https://calckey.club/api/v2/search?q=https://mastodon.social/@MineralCup/109309164407080430&resolve=true&limit=1 to server calckey.club. Status code: 401
2023-08-05 16:19:50.996215 UTC: Error adding url https://calckey.club/api/v2/search?q=https://mastodon.social/@MineralCup/109304845932641792&resolve=true&limit=1 to server calckey.club. Status code: 401
2023-08-05 16:19:51.046937 UTC: Error adding url https://calckey.club/api/v2/search?q=https://mastodon.social/@MineralCup/109304839203772384&resolve=true&limit=1 to server calckey.club. Status code: 401
2023-08-05 16:19:51.047015 UTC: Added 0 posts for user MineralCup@mastodon.social with 3 errors

Copying and pasting a URL into a new tab while logged in yields me this:

{"error":{"message":"Credential required.","code":"CREDENTIAL_REQUIRED","id":"1384574d-a912-4b81-8601-c7b1c4085df1","kind":"client"}}
ToadKing commented 1 year ago

@MrHamel This PR was for adding the ability to fetch posts from CalcKey/FishKey instances, not to run against one. I'm sorry that I accidentally marked this PR as fixing that. I didn't realize that issue was for both cases.

MrHamel commented 1 year ago

In any case you'll be running up against this, which is very annoying to say the least.

{'Server': 'nginx/1.25.1', 'Date': 'Sat, 05 Aug 2023 17:06:51 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '166', 'Connection': 'keep-alive', 'Vary': 'Origin', 'strict-transport-security': 'max-age=15552000; preload', 'Cache-Control': 'private, max-age=0, must-revalidate'}
{"error":{"message":"Rate limit exceeded. Please try again in 1 minute(s).","code":"RATE_LIMIT_EXCEEDED","id":"d5826d14-3982-4d2e-8011-b9e9f02499ef","kind":"client"}}
2023-08-05 17:06:51.752850 UTC: Error adding url https://calckey.club/api/v2/search?q=https://mastodon.social/@liroyleshed/110596080603875926&resolve=true&limit=1 to server calckey.club. Status code: 401 Unauthorized
nanos commented 1 year ago

@MrHamel can you please open an issue for this, as I’ll totally forget about this otherwise. Thanks.

MrHamel commented 1 year ago

Issue #72