openownership / register

A demonstration transnational register of beneficial ownership data from the UK, Denmark, Slovakia and Armenia
https://register.openownership.org
GNU Affero General Public License v3.0
18 stars 3 forks source link

PSC-STM-A7: Deployment of PSC Stream Ingester App #246

Open tiredpixel opened 7 months ago

tiredpixel commented 7 months ago

Instead of a monthly job run on an EC2 instance, this will need to run continuously, so that new records through the PSC Stream API are detected and ingested immediately.

Subtasks include:

Estimate: 12 hours

tiredpixel commented 5 months ago

Static IP Experiment

Heroku doesn't support static IPs, at least not in its standard cloud offering. This prevents it from being used to host Transformer PSC, which needs a static IP to be whitelisted for OpenCorporates reconciliation. An experiment in routing requests through a proxy or similar will be conducted.

It's worth noting that some time was spent considering a similar approach in Jul 2023 during Register 1; however, that wasn't successful, mostly because SOCKS proxies configured via environment variables (or indeed, even specified explicitly in code) seemingly aren't supported by Net::HTTP::Persistent, which Sources OC ReconciliationClient uses. At that time, no Heroku plugins were tried, however. This time, it would be helpful to check not only SOCKS proxies, but whether an HTTP proxy (ideally using TLS) would be sufficient for the purposes. Alternatively, perhaps some other method is available in Heroku via a network wrapper.

tiredpixel commented 5 months ago

IP Test Case using Net::HTTP::Persistent

require 'net/http/persistent'

http = Net::HTTP::Persistent.new
http.proxy = :ENV
p http

uri = URI('https://ipinfo.io')
res = http.request(uri)
puts res.body

This may be run via a console in Heroku.

For this to work, ensure that HTTP_PROXY is set. This can be copied from whatever env var is set from a Heroku plugin.

tiredpixel commented 5 months ago

Fixie

https://elements.heroku.com/addons/fixie

Fixie Socks

https://elements.heroku.com/addons/fixie-socks

QuotaGuard Static IP's

https://elements.heroku.com/addons/quotaguardstatic

QuotaGuard Shield Static IP's

https://elements.heroku.com/addons/quotaguardshield

IPBurger Static IPs

https://elements.heroku.com/addons/ipburger

Proximo

https://elements.heroku.com/addons/proximo

tiredpixel commented 5 months ago

Static IP Experiment Conclusion

SOCKS proxies are not supported by Net::HTTP::Persistent. Whilst these would usually be my preference, they are not necessary since we require only HTTP/HTTPS traffic. So, we can ignore those options which support only SOCKS, as well as those alternative configurations which support SOCKS but also provide HTTP or HTTPS proxies.

HTTPS proxies are also not supported by Net::HTTP::Persistent. This is unfortunate. However, note that it's still possible to access HTTPS sites using CONNECT, and that most of the connection is encrypted:

The most common form of HTTP tunneling is the standardized HTTP CONNECT method.[1][2] In this mechanism, the client asks an HTTP proxy server to forward the TCP connection to the desired destination. The server then proceeds to make the connection on behalf of the client. Once the connection has been established by the server, the proxy server continues to proxy the TCP stream to and from the client. Only the initial connection request is HTTP - after that, the server simply proxies the established TCP connection.

https://en.wikipedia.org/wiki/HTTP_tunnel

Most of the connection is not all of the connection, however. This leaves us with a couple of options:

  1. Rewrite Sources OC code to change from Net::HTTP::Persistent to a library supporting HTTPS proxies (or even SOCKS proxies). This would likely not be a large amount of work, since most of the code is just a few lines long. However, it would be necessary to check carefully for any differences in methods available in an alternative library.
  2. Leave existing Sources OC code as-is (almost), but use an HTTP proxy. Since all the data being accessed is available publicly anyway, and this is just for the OpenCorporates reconciliation connection (not for any database or similar), then this is likely acceptable.

Given (2), which particular Heroku plugin to use (that is, which third-party service) isn't so important. There are a number of options, and some prices are similar depending on the number of requests per month (which we don't currently know). So, we could start with one, and change it easily within a few minutes, with no code changes required. In the case we proceed with HTTP proxies, I'd suggest Fixie or Proximo for this use case, in the first instance.

Also given (2) (so, no rewrite), there is one small change needed, which I'll submit in a PR momentarily.

With one of these options in place, we could plan to host the new Ingester PSC and Transformer PSC apps in Heroku as new apps, configured with proxy plugins, and provide those IPs to whitelisted. This could be 1 IP per app (Proximo and Fixie alternative config), or 2 IPs per app (Fixie default), resulting in 2-4 IPs to be whitelisted (since we would need stg and prd apps).

If everything worked satisfactorily, there would be the option to reconsider the existing EC2 and dev whitelisted IPs, since using a proxy should theoretically also be possible there, too (although a proxy provider would still have to be found).

However, it's useful to note that these Heroku plugins are not doing anything special; they are just proxy providers. So, it would in fact be possible to use other proxy providers, instead, without installing the plugins, or alternatively to sign up directly for one of those services, and to share the usage across multiple applications. e.g. Fixie could be used (https://usefixie.com/), but there are many others—including ones not listed as Heroku plugins.

tiredpixel commented 5 months ago

To clarify something I muddled: Ingester PSC doesn't need run OpenCorporates reconciliation; rather, Transformer PSC does. This means that it doesn't need access to OpenCorporates or any IPs whitelisted, but Transformer PSC will. All the IP-related experiments and notes in this ticket still stand, but apply to Transformer PSC, not Ingester PSC. Thus, they should have been done under https://github.com/openownership/register/issues/252 rather than this ticket.

My apologies for the confusion.

tiredpixel commented 5 months ago

Ingester PSC has been deployed to Heroku.

There is only a production app, since it's not possible for us to run staging apps on the Register data pipelines within our current setup.

Ingester PSC on Heroku is now intentionally in a crash loop; this will be lifted when https://github.com/openownership/register-ingester-psc/pull/33 is merged once the streaming code is ready to go live.

tiredpixel commented 5 months ago

Ingester PSC on Heroku is now live and streaming updates from PSC datasource.