neo4jrb / neo4j-core

A simple unified API that can access both the server and embedded Neo4j database. Used by the neo4j gem
MIT License
99 stars 80 forks source link

Support TLS connections in BOLT adapter #312

Closed pepesenaris closed 6 years ago

pepesenaris commented 6 years ago

This is an early attempt on adding TLS support for the BOLT adaptor. Right now I could use some help/feedback on the following topics:

Given that, where should be check that server certificates are valid? Only in the bolt adaptor or somewhere else in the gem?

Pings: @cheerfulstoic @subvertallchris

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-0.1%) to 74.419% when pulling 9d000cb9dbb01980a17ad530a031461aafa09a12 on pepesenaris:tls-support-for-bolt into b96740c51514043146281a85279925abd6b49a30 on neo4jrb:master.

kfreytag commented 6 years ago

To add some context to that comment above, I was suggesting that a client (or driver) that had already implemented SSL / TLS should transparently support an end-user configuring a secure connection without any mucking with certificates. That is, an end-user of the client or driver should simply be able to say "secure = true" and have the client / driver take care of the underlying work.

That's different than what you're trying to accomplish here, @pepesenaris. Here, you actually need to do the work to isolate the client user from the implementation. Let me try to find you an example from one of our drivers...

pepesenaris commented 6 years ago

right now, my idea is to look for other neo4j clients, maybe python, that already implements this and "get" some good ideas from there.

kfreytag commented 6 years ago

👍 I've also asked internally (to Neo4j, Inc.) for just such an example

kfreytag commented 6 years ago

@pepesenaris try this : https://github.com/neo4j/neo4j-python-driver/blob/1.6/neo4j/bolt/connection.py#L552

cheerfulstoic commented 6 years ago

Thanks for the start on this @pepesenaris ! Looking at the python implementation seems like a good way to jump start this.

The main thing that I would be concerned about is that this is tested. I would ask that it be tested locally and maybe, if possible, with a cloud-based Neo4j service. We might be able to ask one of the providers (GraphStory or GraphineDB) if we can have a temporary instance of their smallest tier for testing this feature since I think they'd like it to be available to their users. I can also test it myself when it's ready.

When it's all ready, it should be fine to just release a new version. It would probably be OK to release a minor version (since the current version is 8.1.1 the next would be 8.2.0) since it's a backward-compatible, non-fix change.

pepesenaris commented 6 years ago

I'll get to this first thing monday morning. I believe @estebanprimost already made a great jump start on this as well. We kind of need this for a client project we are working on at @ingsw-dev. We'll ping back as soon as possible.

pepesenaris commented 6 years ago

@cheerfulstoic We are under early early earlyyyyyy stages of testing this in a product right now. The graph is hosted in Graphene. It seems to be working. We could use some directions on how to address the failing coverage checks.

Shouts to @estebanprimost who did the bulk of painful trial-and-error work to get to this initial code.

I'm guessing there's room for improvement, in particular, in trying to get closer to the official python driver which makes more checks regarding the connection. Maybe @kfreytag can lead us to the proper wingman here. He already helped a lot in the slack channel and mentioned some other Neo4J champions that could help.

kfreytag commented 6 years ago

@pepesenaris (cc: @cheerfulstoic), I'm roping in @technige, who wrote py2neo and is the head of the drivers team here at Neo4j.

pepesenaris commented 6 years ago

@cheerfulstoic adding support for TLS should change the supported versions of the driver? Python uses: [2, 1, 0, 0] but we are not sure if this is directly related to the ssl support

cheerfulstoic commented 6 years ago

This looks great to me, thanks! It looks like the tests are passing, though the coverage has dropped a bit. It would probably be relatively easy to add some tests around the new classes (maybe they could even be in separate files), though that's probably not such a big deal if there can be at least one test run on Travis CI which is using Bolt over TLS. That's all configured in the .travis.yml file. In CI the neo4j-rake_tasks gem is used to download and run a version of Neo4j. Does the Neo4j server have TLS set up by default, so maybe this would just perhaps be a matter of using a different URL? Maybe @kfreytag or @technige could say.

I think that it's OK to be accessing Neo4j with the old version and using TLS. To update to the new version it might be the case that the gem would need to support some of the newer features in the Bolt protocol or else suffer bugs.

pepesenaris commented 6 years ago

We have found some errors today, while testing this. I did a quick search over the JS driver and it seemed to me that they support TLS and use version [1,0,0,0]. We'll try to cover that and get some ideas on the bugs with help from the neo4j team. In the mid time I'll put over a brief description of some of the issues we have found, maybe they come up while developing the first version of the adaptor.

The strange behavior, at least for my limited experience, is that the errors bellow occurs after a while of using efectively the app, and having queries being executed properly. If we just restart the servers the connection works. it's just after a little time pass that we start getting errors all over the place

pepesenaris commented 6 years ago

@cheerfulstoic: Does any of this ring a bell? 1- undefined method `bytes' for nil:NilClass. Happens in

STREAM_INSPECTOR = lambda do |stream|
            stream.bytes.map { |byte| byte.to_s(16).rjust(2, '0') }.join(':')
          end
from neo4j/core/cypher_session/adaptors/bolt.rb:194:in `block in <class:Bolt>'
  from neo4j/core/cypher_session/adaptors/bolt.rb:224:in `log_message'
  from neo4j/core/cypher_session/adaptors/bolt.rb:205:in `block (2 levels) in recvmsg'
  from neo4j/core/cypher_session/adaptors/bolt.rb:398:in `block in receive_message'
  from neo4j/core/cypher_session/adaptors/bolt.rb:397:in `tap'
  from neo4j/core/cypher_session/adaptors/bolt.rb:397:in `receive_message'
  from neo4j/core/cypher_session/adaptors/bolt.rb:204:in `block in recvmsg'

2- Encoding::CompatibilityError incompatible character encodings: ASCII-8BIT and UTF-8. happens in:

            def send_message(message)
              @ssl_socket.write(message)
            end
CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
  from openssl/buffering.rb:316:in `do_write'
  from openssl/buffering.rb:343:in `write'
  from neo4j/core/cypher_session/adaptors/bolt.rb:393:in `send_message'
  from neo4j/core/cypher_session/adaptors/bolt.rb:199:in `sendmsg'
  from neo4j/core/cypher_session/adaptors/bolt.rb:189:in `block in send_job'
  from neo4j/core/cypher_session/adaptors/bolt.rb:186:in `tap'
  from neo4j/core/cypher_session/adaptors/bolt.rb:186:in `send_job'
  from neo4j/core/cypher_session/adaptors/bolt.rb:133:in `send_query_jobs'
  from neo4j/core/cypher_session/adaptors/bolt.rb:54:in `block in query_set'

3- Errno::ETIMEDOUT: Connection timed out Same as 2-.

            def send_message(message)
              @ssl_socket.write(message)
            end
Errno::ETIMEDOUT: Connection timed out
  from openssl/buffering.rb:325:in `syswrite'
  from openssl/buffering.rb:325:in `do_write'
  from openssl/buffering.rb:343:in `write'
  from neo4j/core/cypher_session/adaptors/bolt.rb:393:in `send_message'
  from neo4j/core/cypher_session/adaptors/bolt.rb:199:in `sendmsg'
  from neo4j/core/cypher_session/adaptors/bolt.rb:189:in `block in send_job'
  from neo4j/core/cypher_session/adaptors/bolt.rb:186:in `tap'
  from neo4j/core/cypher_session/adaptors/bolt.rb:186:in `send_job'
  from neo4j/core/cypher_session/adaptors/bolt.rb:133:in `send_query_jobs'
  from neo4j/core/cypher_session/adaptors/bolt.rb:54:in `block in query_set'
cheerfulstoic commented 6 years ago

Gah, sorry, I didn't realize I went so long without answering your question.

The undefined method 'bytes' for nil:NilClass issue is not something that I've seen. I was, however, playing around with benchmarking the Bolt adaptor using rbspy recently and I found that a chunk of time was spent in generating these strings for debugging even when they weren't being outputted (🤦‍♂️ ). I just introduced #316 which should fix that. That should make things faster and, while probably not fixing the issue that you've found, at least make it not happen outside of debugging (which only happens if you go out of your way to enable it. It doesn't even happen in development mode because it's extremely verbose)

The timeout issue is something that I've definitely seen and is one of the reasons that I've not been able to fully recommend the Bolt adaptor yet. I will be looking at that in the coming week or two, though. In that PR above I also introduced a rake task which I plan on using to stress-test the Ruby Bolt implementation to be able to reproduce the issue.

Lastly, the incompatible character encodings: ASCII-8BIT and UTF-8 message is also not something that I've seen. Can you reproduce it reliably?

cheerfulstoic commented 6 years ago

I'm going to merge this into master so that I can test the changes, thanks so much and I'll update here with what I find!

pepesenaris commented 6 years ago

Hi @cheerfulstoic, we've been caught up with tons of other issues in the same project we initially tried to add this. Thanks for getting back at this. I have some time off coming soon by the end of next week, with little to no access to internet, but I'm willing to jump in and help before that or later in june.

Right now, not sure about: incompatible character encodings: ASCII-8BIT and UTF-8. I'll get back to you if I can reproduce it again.

cheerfulstoic commented 6 years ago

Cool, thanks!

As I'm looking at the code, another thing that I might try that I talked about with Nigel earlier: Rather than the client Ruby app having to be configured for SSL (@options[:ssl]), I think it makes sense to try the SSL connection first and, if that fails, fall-back to the non-SSL. Better to have both "secure by default" and not having to bother the user with configuration. If we get into the (hopefully) edge-case of "I have a server configured to accept both but I want Ruby to connect over non-SSL", then it should be easy to add the option later, I think. This is assuming that this fallback strategy that I'm proposing will actually work ;)

cheerfulstoic commented 6 years ago

Looking at it further, I actually see that the ssl option allows for custom SSL options, so it will probably be required for custom certs and whatnot. I'll feel it out ;)

cheerfulstoic commented 6 years ago

One other question for when you have the time: Were you able to connect to Neo4j with the default system certs? I'm still figuring things out, but I think I'm having to add the .cert file from the Neo4j server to a cert_store which I set on the ssl_context. Also, did you verify that you were connecting over SSL? I'm setting dbms.connector.bolt.tls_level=REQUIRED in my conf/neo4j.conf to make sure it's not accepting non-TLS connections.

Ideally users wouldn't have to copy/specify the cert file, but I'm not sure what the ideal situation is. Perhaps @technige or @kfreytag might be able to say what the ideal workflow is for any given Neo4j client.

pepesenaris commented 6 years ago

@cheerfulstoic Quick answer: We connected over SSL with a graphene endpoint. They only accept secure connections AFAIK when using BOLT.

kfreytag commented 6 years ago

@cheerfulstoic On the question of "secure by default", I think it'd be best practice to both be secure by default and to throw an error if the connection fails to connect securely. One of the bugaboos of secure connections is a man-in-the-middle attack. If you silently fall back to an insecure connection, you can open that door pretty easily.

cheerfulstoic commented 6 years ago

@kfreytag An very good point, thanks! I will make it secure-by-default and fail if it couldn't connect securely, but allow users to turn SSL off.

I'm going to pause on this for a couple of hours, but I was able to get a connection to Neo4j while TLS was REQUIRED, so that's awesome. I'm actually thinking it would be really comforting to make the logger statements differentiate between BOLT+TLS and BOLT UNSECURE depending on which connection is being used. But so far I need to give Ruby the path to the .cert file from the Neo4j server. Is that something that users will be required to do?

cheerfulstoic commented 6 years ago

Maybe that's only because I'm setting up a version of Neo4j locally from the .tar.gz? Do Neo4j services like GraphStore and Graphene use certs that are in public keystores? If so, I should perhaps try that out.

kfreytag commented 6 years ago

@cheerfulstoic Why don't I set you up with a Cloud instance against which you can test?

cheerfulstoic commented 6 years ago

@kfreytag got me set up with a Neo4j cloud instance and I had success testing without needing to specify a certificate. I think things are pretty much working now. I've made a 8.2.x branch of neo4j-core. It includes changing to SSL by default and it outputs in the logs either BOLT+TLS or BOLT UNSECURE depending on what connection is established. There's no fallback. It simply tries to establish whatever connection was configured.

I also made some documentation to cover how to do this:

http://neo4jrb.readthedocs.io/en/9.3.x/Setup.html#configuring-bolt-tls

Like it says, if you're using a cloud instance it should just be a matter of specifying a URL, but it also covers defining your own certificate and allows for more advanced SSL configuration.

cheerfulstoic commented 6 years ago

If you'd like to give it a try you can do this in your Gemfile:

gem 'neo4j', github: 'neo4jrb/neo4j-core', branch: '8.2.x'`

I'm going to try setting it up with our app here at work as well.

Next I'll try seeing if I can reproduce the timeout issue.

pepesenaris commented 6 years ago

Let me wrap up the current PR, and I'll try to give it a shot

pepesenaris commented 6 years ago

I tested shortly today with the graphene server. I did not find timeout issues. I'll try to get back to it during the weekend. I did get stuck setting up the local environment using the neo4j image, I'll give another try as soon as possible

cheerfulstoic commented 6 years ago

:+1: Thanks!

Small update, I was able to get the code working with TLS in a local docker container as well by mounting the certificates directory as a volume. I ran our tests and they seemed to run mostly fine. Here are the results:

Bolt:
Finished in 14 minutes 7 seconds (files took 11.94 seconds to load)
4838 examples, 21 failures, 2 pending

HTTP:
Finished in 15 minutes 29 seconds (files took 15.45 seconds to load)
4838 examples, 25 failures, 2 pending

I looked at the failures a bit and I don't think they are related to Bolt (in fact, there are fewer errors with Bolt, which is interesting). I haven't run the test suite in a while so my setup might not be configured quite right. But the fact that it ran 1 minute and 22 seconds faster is encouraging.

cheerfulstoic commented 6 years ago

After fixing a couple of things it ran even faster:

Finished in 11 minutes 47 seconds (files took 15.72 seconds to load)
4838 examples, 7 failures, 2 pending

The tests ran in 13 minutes 32 second on our CI server recently, so that's not bad...

Now I just want to figure out the timeout issue ;)