student-hub / acs-upb-mobile

A mobile application for students at ACS UPB.
MIT License
27 stars 22 forks source link

Scraper scalability, caching and improvements proposal #56

Open razvanra2 opened 4 years ago

razvanra2 commented 4 years ago

What's good about the scraper:

IoanaAlexandru commented 4 years ago

@razvanra2 there's no need to manually implement a 2-layer caching system. flutterfire, the packages we use for the Firebase APIs, does that by default (caches the data on-device and only updates the documents that have changed). Therefore, just storing/fetching the data from Firebase should be enough, provided we have a way to keep it updated there - maybe via a cloud function that runs regularly to check if there is anything new on the news website. A cloud function would probably also be easy to integrate with Firebase messaging, meaning that if we do see that the news have updated, we can send a notification to the users.

As for the nice to have features - I'd try to focus on the faculty/university-specific news (https://upb.ro/stiriupb/ or even the LSAC facebook page), there's a pretty small subset of users who would be interested in erasmus so I wouldn't see that as a priority. Filtering on data sources is very easy to achieve once we have multiple scrapers, and I like the idea of a provider factory (although I'm not 100% sure if it's really necessary). I wouldn't worry about sorting until we have a larger dataset to sort, the current sort by date is okay for now.

Another interesting feature if you wanna get into ML solutions would be to have auto-generated tags for news, that would be pretty cool. And something that's probably simpler to do would be to show the user which news he hasn't seen since they last opened the app.

@GeorgeMD what are your thoughts on this?

GeorgeMD commented 4 years ago

My thoughts on this are that we should first decide the sources of the news. Different sources propose different challenges. Right now we only use https://acs.pub.ro/topic/noutati/ which gets updated very rarely and can be cached locally as there's not a lot of data (just small text). For other sources (like a facebook page) that get updated multiple times a day we should look at a cloud function as @IoanaAlexandru says. This way we can cache the results in the database, and offer everyone the cached result. The main challenge will be making sure that if user A sends a request, gets a cache MISS and while the data is loaded and parsed, user B won't start the same process (meaning the code that parses the facebook data and puts it in the firebase should NEVER run more than 1 time at a given moment).

As for horizontal scalability. You'll find the data you want in different places on different sites, meaning you can't reuse selectors. This project uses the Providers (pretty much factory pattern / on demand dependency injection), meaning we register each scraper and can get an instance when we need, no need for a scraper factory in my opinion. Also based on the first paragraph, scrapers may be spread between users and cloud functions, depending where it makes more sense for them to live.

Ideally, we should have all scrapers as cloud functions (once we are on the paid plan, this should be achivable), so that we can use google servers to watch the data sources and push notifications to users, but for now we can get away with a background service on users' devices that checks the news every 1-2 hours or less often. We can't use such a service to check more constantly as we may use too much battery and mobile data.

Finally, I think we should consider watching google sheets where marks are posted and updating users' marks based on that. This might be achivable without the paid plan, as it's a google service talking to another google service.

razvanra2 commented 4 years ago

I really believe scraping facebook and/or twitter is a challenge in it of itself and we should dismiss it for now, even if we'd get a lot of value out of it. I strongly believe it's hard to scrape a facebook page without a facebook api (that we'd have to pay for). It's also arguably imoral. Arguably.

As a secondary data source for the news feed I'd propose https://upb.ro/stiriupb/ - we're all interested in news related to the campus / whole university, not just acs.

I agree that cloud functions (which i really enjoy working with) are the best approach. Having a chron expression triggered function for each data source that updates our personal, distributed caching system would be great. However, I think that making some data scraping local and some distributed via a cloud function would be a mistake. Such an obfuscation of the logic would cause async mess and uneven code.

My initial proposal of having the user write back to a distributed cache was based on a discussion with Ioana. To my understanding, we don't have the budget to use cloud storage / cloud functions, at scale. I understand the concern for battery life on mobile, however, I think making the user write back to the cache is not that big a deal, for our use case. My reasoning is:

To wrap it up: I still think a distributed caching system is both a good and a fun idea to implement. Using end-devices to write back to cache is not optimal, but it's what we can do for now and if implemented correctly, it can be far less of a concern on performance and battery life than it may seem. Having multiple cloud functions would be really cool, but let's work with what we have. I'd say we should keep the mindset that we might shift to a new architecture in the long run and write the code with that in mind.

PS: really good idea about watching google sheets PPS: I see your point on providers and read the PR you're about to merge before even raising the issue at all. The point I'm trying to make is that stuff like final newsFeedProvider = Provider.of<NewsFeedProvider>(context); is not scalable if you're going to have 7 different newsfeedproviders with slightly different sources / specs that respond asnychronously and have to be performant under some limit of time to ensure the app is as zippy as it can be. But this is a longer discussion and I'd rather not type it out.

IoanaAlexandru commented 4 years ago

We now have access to the Firebase Blaze plan, so I think it's worth looking into how cloud functions can scale without (or minimally) surpassing the free quota before attempting to do things manually. While I know it's a fun exercise, we shouldn't overlook a simpler solution in favour of a more complicated (albeit exciting) one which may not even net the same results.

As for the providers, I currently see no point in having multiple ones. We can simply have different methods in our current NewsProvider fetching data from different sources. With the provider architecture in flutter, the only reason you may really need to make different providers is when different pages in the app require different data sources - but that's not the case for us, so a single provider should be more than enough. It boils down to a class that multiple pages can access, after all, having multiple methods in it for fetching (the same kind of) data doesn't add any overhead and doesn't mess up the architecture.

We can discuss the next steps at length in a meeting later.

razvanra2 commented 4 years ago

Did you test Firebase Blaze plan out? I'm noticing outbound traffic being limited to google services. That's a 👎 for us

IoanaAlexandru commented 4 years ago

I didn't test it and don't know much about it honestly, we just activated it like an hour ago. I see here we do have 5GB free of outbound non-Google API traffic per month.