transitmatters / gobble

🦃 Process MBTA events into a format that can be consumed by the Data Dashboard
MIT License
2 stars 3 forks source link

Make data available to the dashboard #2

Closed mathcolo closed 10 months ago

mathcolo commented 1 year ago

So right now events are being written to disk on the instance. The data dashboard needs to get a hold of them somehow, though. I see a few distinct options...

a) Upload the events files to S3 overnight every night, accepting that we just won't have live bus or CR. (lame) b) Every time we append an events.csv on disk, upload the entire thing to S3. (maybe, but like...no) c) Serve live events over http that the dashboard can request on-demand.

My hunch is that we want to do (c), with some (a) sprinkled in. It's cool when things are live, and we shouldn't give that up. So rough steps:

  1. In a new process, create an express server that serves up events from ./output.
  2. Throw a load balancer in front of it, and wire up the load balancer to a .labs DNS record with the wildcard cert for https.
  3. Maybe add pre-shared key auth, since this is for internal use only?

FAQ a) Why cannot the dashboard talk to the EC2 instance via its private IP, such that we can keep the EC2 instance off the public internet? That's possible, but it's a pain b) If the EC2 instance has a public IP address, can the dashboard lambda just talk to that? Yes, it could. But the load balancer option lets us easily add https using the wildcard cert, which, even with no private data involved is good citizenry.

devinmatte commented 1 year ago

I think there's also another option d) Upload from disk at a steady interval that is close to live

mathcolo commented 1 year ago

Oh, yeah, lol—that could work! if you do that, please keep it separate from the MBTA-provided events (maybe use a new key, Events-ours instead of Events inside the bucket?). Also could be handy to keep track of uploaded sha256 hashes so things that aren't changed don't get uploaded again.

Also e) Dynamo?

devinmatte commented 12 months ago

@hamima-halim @mathcolo I won't have time in the next few weeks to work on this so feel free to grab it. Personally I think s3 upload will be the easiest to work with on the dashboard side

hamima-halim commented 11 months ago

I can take this! PR incoming, will do the extremely chill thing of 30-minute-interval scheduled upload jobs.