public-transport / transitous

Free and open public transport routing.
https://transitous.org
181 stars 55 forks source link

Run import on separate machine #9

Open jbruechert opened 8 months ago

jbruechert commented 8 months ago

The import of openstreetmap data needs a large amount of RAM and CPU time. We should run it on a separate server in order to not impair the performance.

Maybe Spline has enough resources to host a separate import vm, but I'm not sure. The import vm needs ~128G RAM, and ideally ~200GB SSD storage.

I would like to run the import publicly visible, for example on GitHub Actions, in order to enable more people to fix things without having access to the server.

vkrause commented 8 months ago

The costly things here are all related to OSM data it seems, not GTFS. And some of these are even services that OSM hosts with global coverage. Probably worth discussing with the OSM team how they manage that. The best case scenario are incremental updates, but they are not available for everything and usually come with other costs (e.g. for maps.kde.org we pay with 2TB NVMe storage for a daily update that takes just 90secs for the full planet database), but maybe there's a more manageable middle ground that doesn't need 4x the pbf size in RAM.

exmatrikulator commented 8 months ago

An CI/CD import pipeline would be great. I can imagine to run a Hetzner VM, import and rsync the out and just pay the hours. But wouldn't it be enough (for the moment) to have an imported Europe map? Later we can setup an update script. But no one will blame you for a few week outdated background map ;)

vkrause commented 8 months ago

But wouldn't it be enough (for the moment) to have an imported Europe map?

That's what we were trying, but even that cannot be imported with 128G RAM so far, see #11 and #12 for details, so that is probably what we need to solve first here.

exmatrikulator commented 8 months ago

Can you decrease the amount of CPU in motis or the VM? Every node needs RAM, so the process will take longer but could be successful.

derhuerst commented 7 months ago

I would like to run the import publicly visible, for example on GitHub Actions, in order to enable more people to fix things without having access to the server.

We could use GitHub's self-rested runners system to have the CI connect to the VM. This looks like it's the most integrated way. (It is also how bbnavi builds OTP's routing graph.)

Technically, this setup could also be configured to only spawn the runner VM on-demand, but that probably requires a fair amount of glue code.

jbruechert commented 7 months ago

The infra I have available unfortunately doesn't have enough resources for the current import in general. Even for a short time, renting large enough vms is fairly expensive.

From my point of view, we could just not have a background map at all for now, and then work on integrating valhalla fully, which needs less resources.

PartTimeDataScientist commented 7 months ago

I would like to run the import publicly visible, for example on GitHub Actions, in order to enable more people to fix things without having access to the server.

We could use GitHub's self-rested runners system to have the CI connect to the VM. This looks like it's the most integrated way. (It is also how bbnavi builds OTP's routing graph.)

Technically, this setup could also be configured to only spawn the runner VM on-demand, but that probably requires a fair amount of glue code.

We have servers available at the NRW.Mobidrom which we purchased with exactly the use-case "preprocessing of routing data" in our minds and which should thus be powerful enough. The bare metal machines are 32 cores, 256GB RAM, 2 TB SSD, 12 TB HDD capacity) and we plan to run Proxmox VE as host os, so a VM with a self-hosted runner should not be a big deal.

However, I am not yet completely sure about the security implications. Github reccomends to be very careful with self-hosted runners:

We recommend that you only use self-hosted runners with private repositories. Source

jbruechert commented 7 months ago

That sounds great!

As long as the vm is not in any "interesting" network, compromising the vm should not be that fatal, we can always just restore a snapshot.

Additionally we should make sure that jobs on that runner can only be started from the main branch of this repository, were only trusted people can push to.

PartTimeDataScientist commented 7 months ago

Additionally we should make sure that jobs on that runner can only be started from the main branch of this repository, were only trusted people can push to.

As far as I understand the docs exactly that is something that's not possible although discussions about that are ongoing for years 😞

For private repositories there meanwhile is the option to disable workflows from private forks Link but the only thing you seem to can do for public repositories is to require manual approval of every workflow run from all outside collaborators Link

Fork-Pull-request-workflows-from-outside-collaborators

jbruechert commented 7 months ago

That's pretty bad :(

GitLab has a feature to limit runners to protected branches, which would be exactly what we need here

PartTimeDataScientist commented 7 months ago

We just discussed the topic internally at the Mobidrom: Due to other priorities set-up and configuration of our internal servers might take some more weeks and with the implications discussed above we would prefer to not host a runner for a public repository within our network infrastructure.

Nonetheless as mentioned in the OpenTransportMeetup: We are willing to support the project with a performant Hetzner Server anytime needed (even if only to mitigate current performance issues that might be solved e.g. by external Valhalla instances later).

Looking at the current Hetzner dedicated server offerings a EX-130S might be well suited: That features a 24-core Xeon 5412-U, 2x3.8TB NVME SSDs and can be configured with up to 768GB RAM which should be well enough to either host a perfomant external Github Actions Runner (we might want to add monitoring/alerting to get notified if someone starts a cryptominer on it 🤷🏻‍♂️) or as host for the project whatever is more useful.

Let me know what you think...