Closed rNLKJA closed 1 year ago
Function: extract information from twitter data
# compile target patterns
pattern_id = re.compile(r'"_id":"(.*?)"')
pattern_date = re.compile(r'"created_at":"(.*?)"')
pattern_author = re.compile(r'"author_id":"(.*?)"')
pattern_text = re.compile(r'"text":"(.*?)"')
pattern_location = re.compile(r'"full_name":"(.*?)"')
# define None return to an empty string
NONE = ''
# define extraction function
def extract_information(target):
_id = pattern_id.search(target)
date = pattern_date.search(target)
author = pattern_author.search(target)
content = pattern_text.search(target)
location = pattern_location.search(target)
return {
"_id": _id.group(1) if _id else NONE,
"created_at": date.group(1) if date else NONE,
"author": author.group(1) if author else NONE,
"text": content.group(1) if content else NONE,
"location": location.group(1) if location else NONE,
}
Expected outcomes:
target is a string object.
Mastodon Python API has fetch limitations: 5min fetch 500 toots try to build proxies with concurrency to solve this issue.
Issues with Twitter Processing Scripts:
1491571754170421248
appears over 100 timesprocessingV1
use re.compile
before search, avoid re compile re string for each round, increase the efficiencyborrow @yyao0029 computer for coding work (remote in syndey)
processed sal.json now available on CouchDB
Only merge after all required tasks are done: