Closed rivernews closed 4 years ago
What platform should we use? Java AWS S3 SDK
Where should we start making the change in code? In App(), we should create an abstraction for data. Also, we have to consider scaling up the scraper -- that is, accepting a list of company names for search, instead of just one. But perhaps we should create another issue ticket for this.
Regarding abstraction
new ()
in App.At the end of the day, it's hard to seek out the perfect solution for data archiving. Considering cost, feasibility, and performance.
Solution | Cost | Feasibility | Performance | Scalability |
---|---|---|---|---|
S3 | Cheap for 1~10 GBs of data, but not for >100 GB | AWS Java SDK, looks handy. Overall S3 a key-value store | Over http connection, so definitely slower then local storage or local connection. But if using a cloud cronjob w/o time limit, this may be small concern | When data is large, storage: is mainly "cost", could viral up w/ all the in/outbound charge $$$. Performance: loading data in takes a lot of time - http I/O penalty + local memory load-in for computation. |
Local database (SQL: postgres, NoSQL: mongoDB, redis) |
If you think of the ideal workflow for a single company, you'll realize it's already lots of things to handle. Let's see:
We'll implement components like:
Words in bold are more abstract and needs breakdown.
Diff with current progress:
basicTimestamp
in basic parsed data, companyScrappingDuration
-- do we need this? Or even better, scraper session log.s3Bucket/companyName/basic/
+ timestamp.json
.... write file as soon as basic data scraped. More specifically in event ScrapeBasicDataFromCompanyNamePage.postAction()
. Actually, right after storing timestamp
.s3Bucket/companyName/reviews-meta/
+ timestamp.json
. Implementation in ScrapeReviewFromCompanyReviewPage.postAction()
, likewise, right after timestamp
stored. !! 🛑 this includes time scrapping review as well!!s3Bucket/companyName/reviews/
+ reviewId.json
.As the scraping mechanism is stabilized, the mission of this ticket should reach an end now.
Working branch.