humio fails to start if domain name `humio` resolves

StianOvrevage commented 5 years ago

I have been troubleshooting a tricky issue for a while and was able to make Humio work again today.

It seems if the DNS name humio resolves that IP will be used when the different processes tries to bind, ignoring any HUMIO_SOCKET_BIND setting (localhost or 0.0.0.0).

We run humio on Kubernetes and we expose a service through a Service object which gets it's own virtual IP to de-couple the instance from the DNS name and IP clients use. Services gets their name added to the DNS running in the cluster so Kubernetes API is available at the kubernetes hostname.

When setting up a Service for humio we name it just that, humio, and that causes humio to resolve to the Service (Virtual) IP which the instance then tries to bind to, and of course fails to since it's not a real IP.

A workaround for now is to name the service something else than humio, but it's gonna be what 99% of Kubernetes users will name this service and probably have the same problems as us.

mwl commented 5 years ago

Hi @StianOvrevage. Thanks for reporting. From your description it should be reproducible with the following command

docker run --add-host=humio:1.1.1.1 -p 8080:8080 -it --rm humio/humio:1.2.3

Although, that seems to work just fine. Any chance you can come up with an easier way of reproducing the issue or give us some more insights to the configuration of your scenario?

kaspernissen commented 5 years ago

We encountered the same problem when we moved Humio to Kubernetes a year ago (was reported back then). The problem is because Kubernetes will inject environment variables in the container as described here: https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables

One of these variables will clash with theHUMIO_PORT variable which result in an error on startup of Humio. The workaround, as mentioned by @StianOvrevage is to choose another name than humio for your Kubernetes setup.

mwl commented 5 years ago

Thanks @kaspernissen!

In that case, I struggle to see how we can change that in way where we respect the existing configurations. I'm happy to reopen the issue if you have any suggestions.

StianOvrevage commented 5 years ago

I see now that the cause is the env vars and not DNS.

At least the documentation should be updated to help future Kubernetes users avoid this gotcha.

mwl commented 5 years ago

@StianOvrevage Agreed. Problem is though, we don't recommend running Humio on Kubernetes, due to problematic support for stateful services. Please prove me wrong, because we'll love to provide better support for things like Kubernetes.

StianOvrevage commented 5 years ago

I don't agree that Kubernetes has problematic support for stateful services per se. As Kelsey mentions in the thread the problem is Kubernetes does not have knowledge of all stateful applications and how they work internally, so the application needs to take care of a few hard problems itself ("Stateful services must meet Kubernetes half way and manage their own cluster membership, failover, and replication.").

When running a single-node Humio instance most of these concerns are not really relevant either I guess.

We do not have the option of running services outside of Kubernetes and I guess that goes for many others. The alternative self hosted logging solution is the ELK stack, which is a nightmare to get running stable on Kubernetes :\

kaspernissen commented 5 years ago

Completely agree with @StianOvrevage.

We have successfully been running Humio in Kubernetes for well over a year by now as a single instance without any big issues, especially none that's related to Kubernetes. We haven't had the need to build out a big cluster just yet (even though we might begin to see the need).

As Stian also mentions, you can't expect Kubernetes to take care of everything for you as an application, and therefore it's up to the application to handle many of the distributed computing problems, which many of them do anyway. There's further a move in the community to built custom resources for managing stateful applications, and many are running stateful applications, databases, event stores, etc. in production at a fairly large scale on top of Kubernetes with great success. There's no silver bullet, and officially supporting another platform takes up resources, but I think there's a market for you here :)

pmech commented 5 years ago

Hi @StianOvrevage and @kaspernissen

I believe you are right in that there is a market for running Humio clusters on Kubernetes and we will get there at some point I think. Having Humio as a Helm chart would be cool and something that will broaden our reach.

I'm don't have much experience with k8s, so if you have say 5 decicated servers (most of our customers buy dedicated HW) for running a Humio cluster, what would be the main benefit of using k8s instead of just installing and running it "the old way"?

Also, do you guys run your stateful services (DB's, message queues, and such) on k8s?

It should also be noted that in order to run a Humio cluster in Kubernetes you also have to run a Kafka cluster. It seems that Confluent has done some work on this so it might be readily available today. Have any one of you had any experience running Kafka on k8s?

StianOvrevage commented 5 years ago

We are in the other side of the spectrum. We have no access to the physical hardware since we use Microsoft Azure. And Kubernetes is the next step where we no longer have access to the virtual machine since we use AKS, a managed Kubernetes service. We only interact using manifest files describing how we expect the system to be, and Kubernetes ensures that the system is as close to that state as possible continuously.

To put it another way, we don't ever want to need to think about physical or virtual hardware. We only care about containers. That frees up a lot of time and resources that we can use on higher level activities.

With this we can run lots of workloads without anyone caring about networking, storage, managing and patching OSes and all the different stuff required to keep the lights on. We deploy an application, AKS/Kubernetes takes care of the rest automatically and will for example restart or move an application if there is a failure or updated resource constraints in the overall cluster. It's all pretty cool and I predict it will be a similar story to VMware and be the standard way of doing things the next 10+ years.

I have not had as much experience as @kaspernissen probably. Having a single-instance setup would be very very easy to accomplish (it took me a few hours and it seems to be running fine ever since).

I have not run Kafka or any other serious distributed computing system on Kubernetes yet. But as you mention there are people doing it. I think it's possible, but it does require the software itself to expect some level of failure of storage, network and compute and be able to recover automatically when a cluster is split or a node disappears or similar.

pmech commented 5 years ago

Thanks for the answer @StianOvrevage and the feedback in general!

I agree, that in your context, having Humio on k8s is the right (only) thing to do and I think you are correct in that we will see this more and more in the coming years. I'm still not convinced that k8s is mature enough for building stateful, distributed systems, especially when I have to support and debug them over a video-link in the middle of the night, but I'm not a first-mover in this regard either 😃 . We are continually evaluating, though, and maybe the time is right to have a single humio as a Helm chart.

(Great discussion after this issue was closed 😃).

kaspernissen commented 5 years ago

@pmech I made this helm chart for humio half a year ago: https://github.com/kaspernissen/k8s-humio (it's by no means production ready), but it get's you going quite easily with humio in kubernetes (minikube). To run it as a single instance, it wouldn't require much more work to have it configured to a comparable setup as we run it at Lunar Way.

But you are right, distributed Humio also requires you to run kafka and zoopkeeper clusters, and as you mentions, confluent provides an operator for kafka on kubernetes, but there also exist several other helm chart solutions. I tried the one in the kubernetes-incubator a while ago, and it was super easy to get up and running, but I did not use it for any production system. We do however, run our rabbitmq setup as a clustered solution on top of kubernetes, which runs quite nice. We currently don't run any production databases on Kubernetes, as we have our databases as a managed solution in AWS at the moment. We do however, plan on moving more of our stateful services to kubernetes.

You have an example or reference implementation for nomad, a similar thing for kubernetes would be awesome, and you probably won't have to support it as I take you don't support the nomad solution, because it's just a reference implementation?

There's plenty of articles and conference talks discussing that Kubernetes has crossed the chasm, and that adoption is exploding, which is also what I here in the community here in Denmark and abroad.

pmech commented 5 years ago

Thanks @kaspernissen

All good comments and good to hear that you have started running stateful services with success.

This discussion is pushing us to look at this and see what we can do in both the short and long run. Having something similar to our nomad reference would be good. I'll keep you posted on any progress regarding this.

Best regards,

henrikjohansen commented 5 years ago

@kaspernissen Running Humio under nomad is supported (we are a customer and have written the ref. implementation) mainly because it does not stray from how you would deploy & run Humio by hand (ie. it upholds the expectations that Humio and it's dependencies require to operate). Essentially it's only used for bootstrapping / orchestrating the container deployments which is good enough for us :)

Humio (the clustered version) simply is not capable of running as a distributed and fault-tolerant system in a dynamic, cloud-ish environment ATM and this will probably not change in the foreseeable future.

kaspernissen commented 5 years ago

I'm aware, @henrikjohansen, great to see customers contributing reference implementations. Great job!

This could probably be done, fairly simple with Kubernetes as well, just expose Humio on the host, and don't rely on the container network or kubernetes dns. But of course, that's not very 'kubernetes-like'. What we are thinking of doing is to use the kubernetes resource StatefulSets and spin up three or more dedicated nodes within our Kubernetes environment for only running an instance of Humio-core, kafka, and zookeeper, and then rely on the stable name for each container provided by Kubernetes. Depending on the guarantees we want our logs we should be able to tweak on replication factor or partitions in kafka. But that would not be a fault tolerant setup?

Maybe I'm missing something, but what it's that doesn't make Humio capable of running as a distributed fault-tolerant system in a dynamic, cloud environment at the moment? Humio e.g. provides ansible scripts for spinning up a distributed setup in AWS and packet cloud - but this is not a fault-tolerant setup?

In your opinion, where does Humio fall short in providing a fault-tolerant distributed setup? And what is it in terms of the cloud environment that is the problem? Is it because of the ephemeral nature of a cloud vm, is it because of the networking? Or where exactly would the pitfalls be?

I'm very interested in understanding where you see the limitations, and hearing your thoughts on how to actually run Humio in clustered distributed fault tolerant way. Thanks.

henrikjohansen commented 5 years ago

@kaspernissen I never specifically said anything about "cloud" :) If you can meet the expectations that Humio has (see below) it does not matter what you deploy on or how you deploy.

Most of the challenges are related to Humio and not a cloudy environment per se but I know that the team is working on addressing some of the current shortcomings (namely lack of high availability).

Now, Humio has the implicit expectation that your nodes / network / storage / cluster are stable and predictable -- anything else is a rare event. This is fine, we can architect around that but it is not suitable for what I would call a "dynamic, cloud-ish environment or being particular awesome at "fault tolerance" (did I mention lack of high availability? :) ).

IMHO software suitable for such an environment must be able to deal gracefully with failures (think dead nodes, network latency issues, slow nodes, spurious reboots, etc), play nicely with cloudy tools (think service discovery, etc), be scalable across zones and regions, support dynamic capacity management (grow / shrink the cluster), have proper built-in monitoring, etc, etc.

Humio is just not designed for ☝️and that's fine as long as you don't expect otherwise :)

Just for the heck of it try killing one of your humio-core nodes and see what happens ... or inject packet drops between the nodes ... or increase the network latency between nodes ... or .... :)

On the plus side - we have hammered the shit out of Humio for the last 2 years, switched the complete hardware implementation twice without loosing data and all-in-all had relatively few issues aside from some performance teething problems (which largely where due to the fact that we have one of the busiest Humio clusters out there).

So to reiterate my previous point - if you can meet the expectations that Humio has been designed for it does not matter what you deploy on or how you deploy ... it will probably work just fine.

mortengrouleff commented 5 years ago

The next version (1.2.5) refuses to start if HUMIO_PORT is set to something that is not an integer and instead reports this as an error. It was a bug that HUMIO_SOCKET_BIND was overriden by what looked like a bind in HUMIO_PORT.

kaspernissen commented 5 years ago

@henrikjohansen Thank you for elaborating.

I agree with you that software that runs in a 'cloudy' environment should be able to cope with failures. This is, I guess, sort of the first rule of building a distributed system is to deal with the fallacies of distributed computing.

As far as I understand this thread, the problem is not with Kubernetes itself, but the way humio-core clustering is built? (great to hear that the team is working on improving the availability) Would you say the only reason to distribute humio into a clustered setup at the moment then is to gain performance and not high availability?

We haven't tried clustering Humio yet, but we were are considering it. Can you elaborate a bit more on how humio-core deals with the failure scenarios that you list? It may be fine for us, if we can build automation around it ourselves.

pmech commented 5 years ago

Hi @kaspernissen, answering some of your Humio specific questions.

Currently when a Humio node stops, the data that node is responsible for will be unavailable until it has returned to the cluster. We are fast changing this! We expect to have a new version out, end of this month, where all data will be available, also during a node crash.

kaspernissen commented 5 years ago

So, humio-core shards the data between the instances, but there exist no replication between them at the moment? This is what you are coming with this month, @pmech ?

In the ingest scenario, there's no problem with a single node failure, because we can load balance across the humio-cores instances, right? (if the number of instances is sufficient) It's just the scenario where we need to query data belonging to a node that just disappeared? If that node can come up again with the same dns name, and the volume correctly mounted, then it should be able to query the data again?

pmech commented 5 years ago

We have replication in place. The new version will enable Humio to automatically use this and fallback to another replica for searching.

You are correct regarding ingest.

If you are considering setting up a cluster I think we should do a call. Just let me know @kaspernissen .

Regards,

kaspernissen commented 5 years ago

Sounds good, @pmech.

I will reach out when we dig into building a clustered setup.

humio / issues

humio fails to start if domain name `humio` resolves #62