stephensekula / navierstokes

A bridge between some social networks to improve broadcasting and sharing.
6 stars 3 forks source link

Python Twitter interface appears to yield duplicate tweets #4

Closed stephensekula closed 7 years ago

stephensekula commented 7 years ago

Since implementing the Python Twitter library in the v1.1.X tag series, I have observed that each retweet from twitter appears to have two versions: an original extended version and truncated version. As an example, I retweeted this:

https://twitter.com/BrookhavenLab/status/885559455030562816

which resulted in two tweets being harvested by NavierStokes from my twitter feed. The extended one looks sensible:

From Twitter: Brookhaven's first particle accelerator, the Cosmotron, was the world's highest energy proton accelerator of its day #TBT #BNL70 https://twitter.com/BrookhavenLab/status/885559455030562816/photo/1

and even has media attached. The truncated one looks different enough to fail fuzzy matching:

Retweeted Brookhaven Nat'l Lab (@BrookhavenLab): Brookhaven's first particle accelerator, the Cosmotron, was the... https://t.co/h2QkcPoERl

Look into the stream contents returned by the python twitter API class and see what is going on in the timeline.

stephensekula commented 7 years ago

More information, from NS itself. Running with debugging enabled, we can see that indeed Python Twitter harvests two posts from the stream... and though they are ostensibly the same content core, they are even distinct in their message IDs returned by Twitter. Need to look more at the API Status object and see whether there is some flag set for the second that marks it a "informational" but not "content". The first one I consider content; the second informational ("Steve retweeted something: brief summary...").

======================== MESSAGE OBJECT ========================
 FROM:    drsekula
 DATE:    2017 July 14 08:07:25
 ID:      885559455030562816
 SOURCE:  Twitter
 LINK:    
 REPLY?:  0
 PUBLIC?: 1
 DIRECT?: 0
 REPOST?: 1
From <a href="https://twitter.com/">Twitter</a>: Brookhaven's first particle accelerator, the Cosmotron, was the world's highest energy proton accelerator of its day #T
BT #BNL70 https://t.co/7YKAJxVnXXATTACHMENTS: [u'/tmp/DEj0cZ7WsAAqRFE.jpg']

======================== MESSAGE OBJECT ========================
 FROM:    drsekula
 DATE:    2017 July 14 08:21:00
 ID:      885730830152159233
 SOURCE:  Twitter
 LINK:    
 REPLY?:  0
 PUBLIC?: 1
 DIRECT?: 0
 REPOST?: 0
Retweeted Brookhaven Nat'\''l Lab (BrookhavenLab): Brookhaven'\''s first particle accelerator, the Cosmotron, was the... failureATTACHMENTS: []
stephensekula commented 7 years ago

A bit more information. I looked directly at my Twitter stream. There is a time difference between the two tweets. This means that the first one was posted and unique, then NS shared tweets around, and eventually NS got that message again, thought it was new, truncated it, and then posted it a second time. Fuzzy match didn't catch this, and it leaked through. So this is probably more an NS bug, and not so much a Python Twitter bug. I can debug this more step-wise later. These are just observations.

The timestamp of the first (long) tweet is around 12:07 am US Central time, while the second is 12:21am. There is quite a gap. This suggests NS was working on this message, passing it between networks and mangling it in the process, over that 20 minute period. I have NS set to run every 10 minutes on a cron job, so this is 2 cycles of NS.

stephensekula commented 7 years ago

This problem turned out to be due to the fact that my original implementation of the python-twitter library in TwitterTools.py did not talk to Twitter in "extended" mode, and so post text was cut off. As a result, fuzzy text matching would just fail on this because of where I personally have my fuzzy match threshold set. So it was a freak event that occurred while still finalizing the development of TwitterTools.py. With extended mode the default way of operating (so full text from tweets is obtained), this has not recurred.