Tweets get duplicated sometimes

dawnerd commented 1 year ago

I'm running this in a weird round-about way but it seems to forget it's processed some accounts and reports It seems like pleroma-bot is running for the first time for this Twitter user: Dollywood

Using github actions and Im caching posts.json and the users directory between runs.

name: "Sync Profiles"

on:
  workflow_dispatch:
  schedule:
    - cron: '0 * * * *'

jobs:
  update:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.9'
          cache: 'pip' # caching pip dependencies
      - name: 'Install bot'
        run: pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple -r requirements.txt
      - name: Cache posts
        id: cache-posts
        uses: actions/cache@v3
        with:
          path: posts.json
          key: posts
      - name: Cache users
        id: cache-users
        uses: actions/cache@v3
        with:
          path: users
          key: users
      - name: 'Sync Tweets'
        run: pleroma-bot --skipChecks

pleroma-bot==1.1.1rc40

Gathering tweets... 0
Gathering tweets... 0
Gathering tweets... 0
ℹ 2022-12-02 07:13:22,803 - pleroma_bot - INFO - Current pinned:    None 
ℹ 2022-12-02 07:13:22,804 - pleroma_bot - INFO - Previous pinned:   None 
ℹ 2022-12-02 07:13:24,840 - pleroma_bot - INFO - Updating profile:   <Response [200]> 
ℹ 2022-12-02 07:13:24,841 - pleroma_bot - INFO - ====================================== 
ℹ 2022-12-02 07:13:24,841 - pleroma_bot - INFO - Processing user:   109441675196353512 
⚠ 2022-12-02 07:13:27,027 - pleroma_bot - WARNING - No posts were found in the target Fediverse account (_pleroma.py:97) 

Gathering tweets... 0
Gathering tweets... 0
Gathering tweets... 0
ℹ 2022-12-02 07:13:27,367 - pleroma_bot - INFO - Current pinned:    None 
ℹ 2022-12-02 07:13:27,367 - pleroma_bot - INFO - Previous pinned:   None 
ℹ 2022-12-02 07:13:29,825 - pleroma_bot - INFO - Updating profile:   <Response [200]> 
ℹ 2022-12-02 07:13:29,826 - pleroma_bot - INFO - ====================================== 
ℹ 2022-12-02 07:13:29,826 - pleroma_bot - INFO - Processing user:   109442463652366565 
ℹ 2022-12-02 07:13:29,826 - pleroma_bot - INFO - It seems like pleroma-bot is running for the first time for this Twitter user: knotts 

Gathering tweets... 0
Gathering tweets... 0
ℹ 2022-12-02 07:13:32,598 - pleroma_bot - INFO - Current pinned:    None 
Gathering tweets... 0
ℹ 2022-12-02 07:13:32,598 - pleroma_bot - INFO - Previous pinned:   None 
ℹ 2022-12-02 07:13:34,827 - pleroma_bot - INFO - Updating profile:   <Response [200]> 
ℹ 2022-12-02 07:13:34,828 - pleroma_bot - INFO - ====================================== 
ℹ 2022-12-02 07:13:34,828 - pleroma_bot - INFO - Processing user:   1094424858325[166](https://github.com/opencoaster/theme-park-mastodon/actions/runs/3599664983/jobs/6063511078#step:7:167)34 
ℹ 2022-12-02 07:13:34,828 - pleroma_bot - INFO - It seems like pleroma-bot is running for the first time for this Twitter user: Dollywood

In this example Dollywood had the same status posted before https://opencoaster.net/@Dollywood

Would love to help debug further, and if needed I can provide you access to the full config. Would verbose output help?

dawnerd commented 1 year ago

Switched to running directly on my server and it seems to only double post when it think it's running for the first time, which on the server it was. Guessing even with caching something on the github actions side is making it think it hasn't run again. Where does it store info (if at all?) on what the last post was? Or is it just inferring it from the api?

robertoszek commented 1 year ago

Ah, it didn't even cross my mind running the bot through Github actions, nice!

I see you were running it with --skipChecks so it should skip asking for an initial date and directly get the date of the last post of the Fedi account as the start date for gathering tweets. If no posts are found in the Fedi account, it will get the last 2 days of tweets as a fallback: https://github.com/robertoszek/pleroma-bot/blob/248f65d79cb11b10df96dbcec12dc46b6c6b2020/pleroma_bot/_pleroma.py#L49

The bot also creates some folders (users/<twitter_username>) to keep track if it has been ran before for that Twitter user.

So I'm thinking perhaps you ran it to often the first time? (your cron seems to be configured to run every minute '0 * * * *') Maybe your previous run didn't publish a post in time for the next run triggered by cron to get the date from the last published post on your Fediverse instance.

dawnerd commented 1 year ago

Cron should be for every hour, at least thats how it ran on github, though do wonder if it's just a timezone issue on the github runners - they are distributed after all. Does look like the users directory was cached, but no telling if it was cached correctly or fully.

I'll try to dig in some more when I get time next week

robertoszek commented 1 year ago

Ah, right! My bad, I read the cron expression wrong (as usual 😅). It's actually as you said, ran every hour (at minute 0).

So I'm wondering if the caching of the users folders it's the culprit then. I haven't delved too deep on Github actions so I can't really tell if the caching it's set up as it should. Perhaps listing the contents of a cached folder in the workflow to check what's inside is worth trying, just to double check it's doing what you expect it to.

In any case, posting your config (with any data you deem sensitive removed) and the log of the bot on verbose mode wouldn't hurt to see what's happening.

robertoszek commented 1 year ago

Oh, by the way. I've added some more checks that verify if a tweet is already mirrored on the Fediverse instance: https://github.com/robertoszek/pleroma-bot/commit/2681d0311e20d356eb2fc7244ee76a437e6b25da

They are included on 1.1.1rc42: pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pleroma-bot==1.1.1rc42

But it also relies on the posts.json file to do it, so if the caching on Github actions was your issue it won't help.

I forgot to ask, what version are you running?

dawnerd commented 1 year ago

Oh excellent, I'll give that a shot. I'm running 1.1.1rc40 btw

dawnerd commented 1 year ago

Haven't noticed anything double post so far so I'd say your fix worked. Thanks again for looking into it.

robertoszek / pleroma-bot

Tweets get duplicated sometimes #101