Architecture proposal for scrape services

shei99 commented 1 year ago

The current built of the Docker-Image is very time-consuming, because all the scrape-jobs (date-scraper, room-distances, course-downloader, spo-parser) are running during the built time. This is definitely not ideal, when it comes to implementing a CI/CD Pipeline with automatic deployment. (issue #226)

The introduction of an asset-server (issue #216) is a good starting point to target that. But the Dockerfile of the asset-server still executes the scrape-jobs during the built time of the image.

The solution to this could be to add functionality to the scrape-jobs, that they are executing their task, e.g. every day. (Python lib: https://schedule.readthedocs.io/en/stable/). This would lead to the problem that the neuland.app needs to be able to continuously request the latest data of the scrape-jobs. I propose either a NoSQL-Database like Redis or MongoDB to store the JSON data of the scrape-jobs or use the Azure Blob Storage. Either way, I would also introduce a new universal storage component (storage-service), which saves the data to the data store. This way, not every scrape-job needs to implement the saving functionality, but rather send the JSON data to an endpoint at the storage-service.

The following picture visualizes this proposal.

NeulandArchitecture

M4GNV5 commented 1 year ago

A few thoughts:

we should not lock ourself into the Azure world by using their Blob Storage
My personal preference is Redis over MongoDB
I think we should not mix storage-node and direct db access, so either:
- directly talk to the Redis store without the storage-service, since redis libraries are available for all major languages and adding an http library as a dependency isnt much different to a redis library
- have the neuland.app also communicate with the storage via the storage-node (maybe even from the frontend using CORS requests)
in the long run we could move mensa, reimanns and even free room scraping to store data in the storage node rather than implementing scraping inside the next-app API

shei99 commented 1 year ago

Regarding vendor lock-in of Azure: Vendor lock-in is only a valid concern, when it comes to a considerable financial investment in a specific cloud provider. Since we only store a couple of JSON-files there and only do a handful of read and write operations per day, the financial burden is very low. Having a blob of 1GB with 10.000 read and write operations results in 0.20ct/month. Furthermore, other cloud-providers also have the functionality of storing documents, it's not that big of a deal to use this kind of functionality there.

Regarding the software architecture: If the whole architecture would be a microservice architecture for each scrape job, then each service would have its own storage management and accessing the storage (database or blob) directly. Additionally, the microservice would have an API to access the data stored there. But that's not how the scrape jobs are implemented.

I agree, that it does not make much of a difference adding a http or storage dependency. But having the separation of concerns of scraping and storing is a benefit of the storage-service. Then the implementation doesn't need to know, how to store the data, it just needs to push the JSON to an REST endpoint. This makes it easier to implement new scrape jobs and hides the complexity of storing from the developers, who implement the scraping part. That also prevents having accidentally different naming conventions, eg. for the redis storage keys. Furthermore, the neuland.app could request the data over an API of the storage node, which reduces the complexity of the backend to a scheduled API request. Lastly, if there would be a need of changing the storing technology (switching from blob to a database, or change the database), you would only need to change the implementation in the storage node, rather than in all scrape jobs. With that being said, and how the current software architecture is designed, I would highly recommend developing a storage-service.

Robert27 commented 9 months ago

Due to the required assets in the React Native version, we should discuss the idea of the asset server again. A simpler approach might be sufficient tho.

neuland-ingolstadt / neuland.app

Architecture proposal for scrape services #243