rivernews / review-scraper-java-development-environment

An environment to develop review scraper
0 stars 1 forks source link

Storage | Archive: Design data storage #3

Closed rivernews closed 4 years ago

rivernews commented 4 years ago

Working branch.

rivernews commented 4 years ago

Implementation

At the end of the day, it's hard to seek out the perfect solution for data archiving. Considering cost, feasibility, and performance.

Solution Cost Feasibility Performance Scalability
S3 Cheap for 1~10 GBs of data, but not for >100 GB AWS Java SDK, looks handy. Overall S3 a key-value store Over http connection, so definitely slower then local storage or local connection. But if using a cloud cronjob w/o time limit, this may be small concern When data is large, storage: is mainly "cost", could viral up w/ all the in/outbound charge $$$. Performance: loading data in takes a lot of time - http I/O penalty + local memory load-in for computation.
Local database (SQL: postgres, NoSQL: mongoDB, redis)

References

rivernews commented 4 years ago

A Proposal: one single company

If you think of the ideal workflow for a single company, you'll realize it's already lots of things to handle. Let's see:

  1. Determine the company, either by name or the exact page url on glassdoor
  2. Load this information into the scraper cloud platform by environment variables. For now this can be travis or our K8 cloud.
  3. For the first time for this company, scraper starts running, generating data, and archive them to S3.
  4. Since then, cronjob for this company by once a week, or once a month.
    • What about basic company info? This starts to get interesting. While existing review should be unchanged - so that they are just incremental - the basic info might change, like size, etc, or the most obvious, review count. If they will change as well, we need to include a timestamp field for them, or at least bundling all the basic info and give a single timestamp each time we scrape.
    • We'll need to think about this when designing the schema of data storage / archive. (we already determined the fields though, within the scraper)

Goal: a periodic scraper for a single company

We'll implement components like:

Words in bold are more abstract and needs breakdown.

rivernews commented 4 years ago

In action:

Diff with current progress:

  1. Data write implementation, using Java AWS S3 SDK - just explorative, experimental, getting familiar with the SDK, and the S3 key-value store concept. Perhaps writing an abstraction of the SDK for common operation, like simulating the folder-file diagram.
  2. Concerte schema design for company object. Also consider SQL or noSQL - even if we are pretty much sure we want to try out noSQL.
    • Consider what exactly the data looks like when we write out to json. Think about those we haven't stored -- like scraping duration, for a company. And also basic info needs timestamp field - just one is fine.
    • What else? Diff? basicTimestamp in basic parsed data, companyScrappingDuration -- do we need this? Or even better, scraper session log.. Also add timestamp to review metadata, and add timestamp to review data.
    • 🔥S3 directory structure:
      • s3Bucket/companyName/basic/ + timestamp.json .... write file as soon as basic data scraped. More specifically in event ScrapeBasicDataFromCompanyNamePage.postAction(). Actually, right after storing timestamp.
      • s3Bucket/companyName/reviews-meta/ + timestamp.json. Implementation in ScrapeReviewFromCompanyReviewPage.postAction(), likewise, right after timestamp stored. !! 🛑 this includes time scrapping review as well!!
      • s3Bucket/companyName/reviews/ + reviewId.json.
  3. Data write implementation based on the schema
  4. A review-id lookup - no-duplication mechanism.
    • We should also change the data write timing as well - we need to write out upon each review parsed.
  5. A cronjob, self-containing mechanism. (Can leave trigger as undone, but everything else including cronjob repetition automated)
  6. Then ... consider the next step. A web portal UI to have more control over the automation?
rivernews commented 4 years ago

As the scraping mechanism is stabilized, the mission of this ticket should reach an end now.