vladkens / twscrape

2024! X / Twitter API scrapper with authorization support. Allows you to scrape search results, User's profiles (followers/following), Tweets (favoriters/retweeters) and more.
https://pypi.org/project/twscrape/
MIT License
1.13k stars 132 forks source link

rawContent containing RT @@handle some text #74

Closed stygmate closed 1 year ago

stygmate commented 1 year ago

sometime rawContent contain things like RT @@handle @otherhandle [...] some text. Maybe related to this code: https://github.com/vladkens/twscrape/blob/745bc59b662b72e46b0f5277e369767f2b318c06/twscrape/models.py#L237C48-L237C48

stygmate commented 1 year ago

@vladkens I don't have fully understand the annotation # if login changed, old login can be cached in rawContent, so use less strict check but the actual code always produce RT with incorect syntax (when truncated) as rt_msg = f"{prefix}{rt.rawContent}" with prefix = "RT @" always throw things like : RT @the original message.

for me the correct code seems to be:

--- a/twscrape/models.py    (revision 745bc59b662b72e46b0f5277e369767f2b318c06)
+++ b/twscrape/models.py    (date 1694682596100)
@@ -230,12 +230,8 @@
         # issue #42 – restore full rt text
         rt = doc.retweetedTweet
         if rt is not None and rt.user is not None and doc.rawContent.endswith("…"):
-            # prefix = f"RT @{rt.user.username}: "
-            # if login changed, old login can be cached in rawContent, so use less strict check
-            prefix = "RT @"
-
-            rt_msg = f"{prefix}{rt.rawContent}"
-            if doc.rawContent != rt_msg and doc.rawContent.startswith(prefix):
+            rt_msg = f"RT @{rt.user.username}: {rt.rawContent}"
+            if doc.rawContent != rt_msg:
                 doc.rawContent = rt_msg

         return doc
vladkens commented 1 year ago

Patch https://github.com/vladkens/twscrape/pull/76 merged in v0.9. Thanks @stygmate