rNLKJA / Australia-Social-Media-Analytics-on-the-Cloud

2023 S1 Cluster and Cloud Computing Assignment 2
http://172.26.130.83:3000
MIT License
4 stars 0 forks source link

Web Scraper & Data Processing #4

Closed rNLKJA closed 1 year ago

rNLKJA commented 1 year ago

Only merge after all required tasks are done:

rNLKJA commented 1 year ago

Function: extract information from twitter data

# compile target patterns
pattern_id = re.compile(r'"_id":"(.*?)"')
pattern_date = re.compile(r'"created_at":"(.*?)"')
pattern_author = re.compile(r'"author_id":"(.*?)"')
pattern_text = re.compile(r'"text":"(.*?)"')
pattern_location = re.compile(r'"full_name":"(.*?)"')

# define None return to an empty string
NONE = ''

# define extraction function
def extract_information(target):
    _id = pattern_id.search(target)
    date = pattern_date.search(target)
    author = pattern_author.search(target)
    content = pattern_text.search(target)
    location = pattern_location.search(target)

    return {
        "_id": _id.group(1) if _id else NONE,
        "created_at": date.group(1) if date else NONE,
        "author": author.group(1) if author else NONE,
        "text": content.group(1) if content else NONE,
        "location": location.group(1) if location else NONE,
    }
rNLKJA commented 1 year ago

Expected outcomes:

image

target is a string object.

rNLKJA commented 1 year ago

Mastodon Python API has fetch limitations: 5min fetch 500 toots try to build proxies with concurrency to solve this issue.

rNLKJA commented 1 year ago

Issues with Twitter Processing Scripts:

rNLKJA commented 1 year ago

borrow @yyao0029 computer for coding work (remote in syndey)

rNLKJA commented 1 year ago
image

processed sal.json now available on CouchDB