Storage | Archive: Design data storage

rivernews commented 4 years ago

Storage that is not concern of size limit, or a size limit w/ an archive mechanism
A storage that is able to "continue" / patch later-on data?
Support "review id lookup" operation
Acceptable performance

rivernews commented 4 years ago

Implementation

What platform should we use? Java AWS S3 SDK
Where should we start making the change in code? In App(), we should create an abstraction for data. Also, we have to consider scaling up the scraper -- that is, accepting a list of company names for search, instead of just one. But perhaps we should create another issue ticket for this.
Regarding abstraction
- We can create a class that bootstrap the data storage process, and then new () in App.
- Do an overall design here because this is gonna matter later - think about what else we are gonna do next. Currently, now data storage (archiving), and next data pipeline, then NLP analysis.
- We pretty much plan to use S3 when doing this data storage, archive thing. But, will archiving in S3 get in the way of something later on? Just want to make sure cuz choosing S3 may be irreversible. More detail, we are planning to store companies on 1st level of the bucket, then in each company's directory, we store basic company info, then a directory under called "reviews". Under the reviews directory we will store a list (array) of reviews data. Each review will occupy an object, or say a file, in the directory "reviews". So we'll end up having a list of files under "reviews". Each file should be small in size. So it is expected that we end up having lots of files. The scaling up will impact mostly here - the object (file) amount.
- Data pipeline -- here we pretty much just mean scrape more company data, or scrape new review for existing data, so basically incremental operations. So not much problem here.
- NLP analysis -- before this we need to pull out large amount of data, here it means we will pull out the big data from S3, then we can do the aggregation and text processing. How would pulling out large data from S3 look like? We may refer to S3 AWS SDK. Think about what kind of NLP research we want to conduct - like correlation between the text, and the rating value. Or, just focusing on the text itself.
- NLP - we mostly likely want to start from analyzing within one company. As we got more mature about the analysis, we can move on to more high level like horizontal analysis between companies, spotting a trend.
- Either way, what I think we'll suffer from I/O penalty is pulling data from S3 into the computation environment. This can be memory, which means storing in local memory for everything; or transferring all data over http request. This can be a burden if we end up with GBs of data. Of course, we will have to wait long to reach that level of data scale, nothing urgent. But since we have plans to commit this project in the long term, this is definitely within our scope now.

At the end of the day, it's hard to seek out the perfect solution for data archiving. Considering cost, feasibility, and performance.

Solution	Cost	Feasibility	Performance	Scalability
S3	Cheap for 1~10 GBs of data, but not for >100 GB	AWS Java SDK, looks handy. Overall S3 a key-value store	Over http connection, so definitely slower then local storage or local connection. But if using a cloud cronjob w/o time limit, this may be small concern	When data is large, storage: is mainly "cost", could viral up w/ all the in/outbound charge $$$. Performance: loading data in takes a lot of time - http I/O penalty + local memory load-in for computation.
Local database (SQL: postgres, NoSQL: mongoDB, redis)

References

How to use folder concept in S3 JDK?; another SO answer

rivernews commented 4 years ago

A Proposal: one single company

If you think of the ideal workflow for a single company, you'll realize it's already lots of things to handle. Let's see:

Determine the company, either by name or the exact page url on glassdoor
Load this information into the scraper cloud platform by environment variables. For now this can be travis or our K8 cloud.
For the first time for this company, scraper starts running, generating data, and archive them to S3.
Since then, cronjob for this company by once a week, or once a month.
- What about basic company info? This starts to get interesting. While existing review should be unchanged - so that they are just incremental - the basic info might change, like size, etc, or the most obvious, review count. If they will change as well, we need to include a timestamp field for them, or at least bundling all the basic info and give a single timestamp each time we scrape.
- We'll need to think about this when designing the schema of data storage / archive. (we already determined the fields though, within the scraper)

Goal: a periodic scraper for a single company

We'll implement components like:

The scraper
A spot in S3, for that specific company's data
A "review id" lookup mechanism to avoid duplicated review storage.
A cronjob platform
Actual implementation of data write, must consider scalability for schema. Particularly, company basic info must have datetime field, and platform source field.
The actual implementation of cronjob.

Words in bold are more abstract and needs breakdown.

rivernews commented 4 years ago

In action:

Diff with current progress:

Data write implementation, using Java AWS S3 SDK - just explorative, experimental, getting familiar with the SDK, and the S3 key-value store concept. Perhaps writing an abstraction of the SDK for common operation, like simulating the folder-file diagram.
- How to install SDK and use it in java source code? ~~Maven repo search result.~~ Finally, the aws sdk Maven guide.
- Also setting up aws credentials.
- Now we need a S3 example. See AWS SDK doc for S3.
- Looking at two examples - s3Ops and the s3Operations, not sure why they are separated into two examples but I guess they are slightly demonstrating different usages.
- And the ultimate API Reference for Java2 SDK.
Concerte schema design for company object. Also consider SQL or noSQL - even if we are pretty much sure we want to try out noSQL.
- Consider what exactly the data looks like when we write out to json. Think about those we haven't stored -- like scraping duration, for a company. And also basic info needs timestamp field - just one is fine.
- What else? Diff? basicTimestamp in basic parsed data, ~~companyScrappingDuration -- do we need this? Or even better, scraper session log.~~. Also add timestamp to review metadata, and add timestamp to review data.
- 🔥S3 directory structure:
  - s3Bucket/companyName/basic/ + timestamp.json .... write file as soon as basic data scraped. More specifically in event ScrapeBasicDataFromCompanyNamePage.postAction(). Actually, right after storing timestamp.
  - s3Bucket/companyName/reviews-meta/ + timestamp.json. Implementation in ScrapeReviewFromCompanyReviewPage.postAction(), likewise, right after timestamp stored. !! 🛑 this includes time scrapping review as well!!
  - s3Bucket/companyName/reviews/ + reviewId.json.
Data write implementation based on the schema
A review-id lookup - no-duplication mechanism.
- We should also change the data write timing as well - we need to write out upon each review parsed.
A cronjob, self-containing mechanism. (Can leave trigger as undone, but everything else including cronjob repetition automated)
Then ... consider the next step. A web portal UI to have more control over the automation?

rivernews commented 4 years ago

As the scraping mechanism is stabilized, the mission of this ticket should reach an end now.

rivernews / review-scraper-java-development-environment