Scraper scalability, caching and improvements proposal

razvanra2 commented 4 years ago

What's good about the scraper:

It exists and it's written fine
The idea is solid, having a news feed is fire What's bad about the scraper:
No vertical scalability -> One user makes one call on one update -> if we have 800 users refreshing at the same time, the upb webpage will reload 800 times
No horizontal scalability -> what if we think of implementing 7 other scrapers? How do they fit in? Proposal:
Build a scalable, cacheable solution Implementation details: Vertical scalability:
Build a 2 layer cache system for the newsfeed: ○ When a user refreshes their news feed, it first checks the local cache to see if a similar request has been made in the past k seconds. If yes -> cache hit, retrieve data from local, on device storage vault, load data. If not-> check level 2 cache. On L2 cache hit, retrieve data and write it in L1 cache with appropriate timestamp. On L2 cache miss, write data in L2, then L1, then show data ○ L1 can be implemented in a variety of ways, as a poc I suggest at least storing the json locally and replacing it each time with the newest data known. ○ L2 can be implemented either as cloud storage, table storage or database. ○ Checking L1 and L2 can be done via MD5 hash or a similar solution Horizontal scalability:
Implement a newsFeedProvider factory to spawn newsfeed providers
Implement an observer that keeps track of asynchronous data gets and notifies them to either interrupt the process on hang or accept new data gets and pushes them onto a pipeline handler that delievers them to the ListView Nice to have:
2nd scaraper to test concurrency issues (maybe similar news site or international students' news site? (i.e. the erasmus page?)
Filtering on data sourcers Different sorting features with a "recommended" options that uses some sort of basic ml/ai (maybe word bagging) to promote more relevant information first

IoanaAlexandru commented 4 years ago

@razvanra2 there's no need to manually implement a 2-layer caching system. flutterfire, the packages we use for the Firebase APIs, does that by default (caches the data on-device and only updates the documents that have changed). Therefore, just storing/fetching the data from Firebase should be enough, provided we have a way to keep it updated there - maybe via a cloud function that runs regularly to check if there is anything new on the news website. A cloud function would probably also be easy to integrate with Firebase messaging, meaning that if we do see that the news have updated, we can send a notification to the users.

As for the nice to have features - I'd try to focus on the faculty/university-specific news (https://upb.ro/stiriupb/ or even the LSAC facebook page), there's a pretty small subset of users who would be interested in erasmus so I wouldn't see that as a priority. Filtering on data sources is very easy to achieve once we have multiple scrapers, and I like the idea of a provider factory (although I'm not 100% sure if it's really necessary). I wouldn't worry about sorting until we have a larger dataset to sort, the current sort by date is okay for now.

Another interesting feature if you wanna get into ML solutions would be to have auto-generated tags for news, that would be pretty cool. And something that's probably simpler to do would be to show the user which news he hasn't seen since they last opened the app.

@GeorgeMD what are your thoughts on this?

GeorgeMD commented 4 years ago

My thoughts on this are that we should first decide the sources of the news. Different sources propose different challenges. Right now we only use https://acs.pub.ro/topic/noutati/ which gets updated very rarely and can be cached locally as there's not a lot of data (just small text). For other sources (like a facebook page) that get updated multiple times a day we should look at a cloud function as @IoanaAlexandru says. This way we can cache the results in the database, and offer everyone the cached result. The main challenge will be making sure that if user A sends a request, gets a cache MISS and while the data is loaded and parsed, user B won't start the same process (meaning the code that parses the facebook data and puts it in the firebase should NEVER run more than 1 time at a given moment).

As for horizontal scalability. You'll find the data you want in different places on different sites, meaning you can't reuse selectors. This project uses the Providers (pretty much factory pattern / on demand dependency injection), meaning we register each scraper and can get an instance when we need, no need for a scraper factory in my opinion. Also based on the first paragraph, scrapers may be spread between users and cloud functions, depending where it makes more sense for them to live.

Ideally, we should have all scrapers as cloud functions (once we are on the paid plan, this should be achivable), so that we can use google servers to watch the data sources and push notifications to users, but for now we can get away with a background service on users' devices that checks the news every 1-2 hours or less often. We can't use such a service to check more constantly as we may use too much battery and mobile data.

Finally, I think we should consider watching google sheets where marks are posted and updating users' marks based on that. This might be achivable without the paid plan, as it's a google service talking to another google service.

razvanra2 commented 4 years ago

I really believe scraping facebook and/or twitter is a challenge in it of itself and we should dismiss it for now, even if we'd get a lot of value out of it. I strongly believe it's hard to scrape a facebook page without a facebook api (that we'd have to pay for). It's also arguably imoral. Arguably.

As a secondary data source for the news feed I'd propose https://upb.ro/stiriupb/ - we're all interested in news related to the campus / whole university, not just acs.

I agree that cloud functions (which i really enjoy working with) are the best approach. Having a chron expression triggered function for each data source that updates our personal, distributed caching system would be great. However, I think that making some data scraping local and some distributed via a cloud function would be a mistake. Such an obfuscation of the logic would cause async mess and uneven code.

My initial proposal of having the user write back to a distributed cache was based on a discussion with Ioana. To my understanding, we don't have the budget to use cloud storage / cloud functions, at scale. I understand the concern for battery life on mobile, however, I think making the user write back to the cache is not that big a deal, for our use case. My reasoning is:

Generically, in a caching system, more often you have cache hits, than cache misses. This means that for a large enough active user base (> 10 users, let's say, for us), the phone will just retrieve local data (or distributed data that should get to the device faster) causing faster load tames, less stress on the battery etc.
If writing to local cache is already abstracted for us, then i'd say it's probably done optimally. Writing to local cache then shouldn't be taxing for the battery. Reading from it should be far less taxing than making a GET.
Reading/Writing to the distributed cache: Reading should be fine, worst case scenario it's as bad as a GET from the website itself or from the crawler (as it's implemented now in the PR). Writing is the culprit, as I understant the point that you're making. However, writing to the distributed cache is an edge case that happens rarely (see point 1) on a single user's device. Plus, if we're using native technologies, i'd bet it'd be optimal and a fair trade-off in the long run.

To wrap it up: I still think a distributed caching system is both a good and a fun idea to implement. Using end-devices to write back to cache is not optimal, but it's what we can do for now and if implemented correctly, it can be far less of a concern on performance and battery life than it may seem. Having multiple cloud functions would be really cool, but let's work with what we have. I'd say we should keep the mindset that we might shift to a new architecture in the long run and write the code with that in mind.

PS: really good idea about watching google sheets PPS: I see your point on providers and read the PR you're about to merge before even raising the issue at all. The point I'm trying to make is that stuff like final newsFeedProvider = Provider.of<NewsFeedProvider>(context); is not scalable if you're going to have 7 different newsfeedproviders with slightly different sources / specs that respond asnychronously and have to be performant under some limit of time to ensure the app is as zippy as it can be. But this is a longer discussion and I'd rather not type it out.

IoanaAlexandru commented 4 years ago

We now have access to the Firebase Blaze plan, so I think it's worth looking into how cloud functions can scale without (or minimally) surpassing the free quota before attempting to do things manually. While I know it's a fun exercise, we shouldn't overlook a simpler solution in favour of a more complicated (albeit exciting) one which may not even net the same results.

As for the providers, I currently see no point in having multiple ones. We can simply have different methods in our current NewsProvider fetching data from different sources. With the provider architecture in flutter, the only reason you may really need to make different providers is when different pages in the app require different data sources - but that's not the case for us, so a single provider should be more than enough. It boils down to a class that multiple pages can access, after all, having multiple methods in it for fetching (the same kind of) data doesn't add any overhead and doesn't mess up the architecture.

We can discuss the next steps at length in a meeting later.

razvanra2 commented 4 years ago

Did you test Firebase Blaze plan out? I'm noticing outbound traffic being limited to google services. That's a 👎 for us

IoanaAlexandru commented 4 years ago

I didn't test it and don't know much about it honestly, we just activated it like an hour ago. I see here we do have 5GB free of outbound non-Google API traffic per month.

student-hub / acs-upb-mobile

Scraper scalability, caching and improvements proposal #56