Web Scraper & Data Processing

rNLKJA commented 1 year ago

Only merge after all required tasks are done:

[x] Download data from canvas
[x] De-structure twitter data and find valuable features to store
[x] Extend the scraper with MPI capability
[x] Processing data and upload to CouchDB
[ ] Auto Scraper with timer, obtain 300 GB twitter data

rNLKJA commented 1 year ago

Function: extract information from twitter data

# compile target patterns
pattern_id = re.compile(r'"_id":"(.*?)"')
pattern_date = re.compile(r'"created_at":"(.*?)"')
pattern_author = re.compile(r'"author_id":"(.*?)"')
pattern_text = re.compile(r'"text":"(.*?)"')
pattern_location = re.compile(r'"full_name":"(.*?)"')

# define None return to an empty string
NONE = ''

# define extraction function
def extract_information(target):
    _id = pattern_id.search(target)
    date = pattern_date.search(target)
    author = pattern_author.search(target)
    content = pattern_text.search(target)
    location = pattern_location.search(target)

    return {
        "_id": _id.group(1) if _id else NONE,
        "created_at": date.group(1) if date else NONE,
        "author": author.group(1) if author else NONE,
        "text": content.group(1) if content else NONE,
        "location": location.group(1) if location else NONE,
    }

rNLKJA commented 1 year ago

Expected outcomes:

target is a string object.

rNLKJA commented 1 year ago

Mastodon Python API has fetch limitations: 5min fetch 500 toots try to build proxies with concurrency to solve this issue.

rNLKJA commented 1 year ago

Issues with Twitter Processing Scripts:

[x] Need modular code for reusable code + better debug
[ ] Processed Twitter Exist Duplicates values, processing logic issue, e.g. 1491571754170421248 appears over 100 times
[x] Code efficiency issue: processingV1 use re.compile before search, avoid re compile re string for each round, increase the efficiency
[x] _id, author_id should keep in string format due to unable to upload int data to CouchDB
[x] Timestamp issue: keep it as string format
[ ] If possible, upload the data to database during processing stage.

rNLKJA commented 1 year ago

borrow @yyao0029 computer for coding work (remote in syndey)

rNLKJA commented 1 year ago

processed sal.json now available on CouchDB

rNLKJA / Australia-Social-Media-Analytics-on-the-Cloud

Web Scraper & Data Processing #4