lambdanil / nix-problems

The many issues plaguing Nix
GNU General Public License v3.0
60 stars 0 forks source link

Nix Problems #2

Closed than3-bits closed 8 months ago

than3-bits commented 1 year ago

I enjoyed the write-up. One thing you don't really mention is the basis for packages in Nix, I'm not sure how it compares with GUIX, but from what I've seen Nix is basically just mapping a hash to a package and treating it as unique (only one solution). I subsequently haven't used it as a result of this for anything that matters.

From a fundamental computer science perspective, computers require a system's property for determinism to run correctly, which put simply is that given the same inputs, you never receive more than one set of 'unique' outputs.

This is violated with most common hashing that use fixed-field hashing as Nix does, and necessarily will always have more than one set of output for corresponding inputs. Predicting collisions is of course non-trivial, but the fact remains that collisions exist, and the failure mode is just waiting to happen with little done to handle collisions. Really there is little that can be done because its a class of problem that computers cannot handle in theory, as they run into decidability and halting problems.

In troubleshooting, the most wasteful use of time for an Administrator is tracking down non-deterministic problems, because you have to revert to a guess and check approach with incomplete knowledge of the state of the system (i.e. context). You can't write error handling that will make up for the loss of Determinism and/or Time Invariance, (which itself requires the absence of a Memory property) which are required for effective troubleshooting. Writing Error handling requires matching a known state or characteristic, which are often unpredictable or absent in non-deterministic problems.

The Nix package manager breaks required properties necessary for computation in their design, and being a primary component for the OS/system, a package manager, if these break your system breaks.

It's bad system design to rely on chance for whether your system runs or not. That's my two cents.

max-privatevoid commented 1 year ago

Please find a way to cost-effectively produce collisions for every conceivable fixed-length cryptographic hash function, and then your idea might be relevant. You can start with SHA256. Before that, take a look at the birthday paradox formula and calculate how many store paths one would need to have on the system for a 0.00000001% chance of a collision, and consider whether you will ever have that many packages. Nix uses a search space of 2^160.

than3-bits commented 1 year ago

Max, you seem to lack perspective and appear biased in your response without any real counterargument to what I've said.

Your proposal is basing everything on the flawed assumption that a person needs to fully search the key space, and that there is this dependency for relevance, being that collisions must be cost-effectively produced as a demonstration for the entire fixed-length hash function before discussion. Neither are true. This is non-sequitur, a strawman fallacy, and wishful thinking.

The idea that you have to show and produce collisions for the entire spectrum containing the fixed-length function output prior to relevance is rationally absurd albeit hegelian in structure.

Nix must demonstrate that it improves upon the SOTA better than existing solutions to be considered as suitable replacement for further integration in computation systems. It is not the other way around, and to do that it requires the property of determinism to be preserved across the spectrum of outputs.

It cannot demonstrate this because the fixed-field key space inherently has a many to one relationship, while treating it as unique.

Your proposal and its requirements also has no bearing on the fundamental issues of designing for resilience and cost minimization of labor; two important factors when considering basic risk management practices for any deployment system, which was the raison d'etre for the Nix system to begin with (it claims in the thesis).

As I said, these are the reasons why I'll never build a professional system that has a dependency on Nix.

It is bad design when you have a large scope of unpredictable unknowable unhandle-able failure domain, and if you've worked in professional IT environments for any length of time; when you create systems that depend on uptime, your job depends on those systems not going down and having a very good provable reason when they do.

The boss isn't going to care about probability and likelihood.

max-privatevoid commented 1 year ago

You keep babbling about risk management, yet completely fail to consider any sort of likelihood assessment. It's a pretty good troll, honestly. I like it. You should think more holistically about your decision.

hraban commented 8 months ago

@than3-bits I'm too stupid to understand what you're saying so forgive me if I got this wrong--are you talking about sha256 collisions?

than3-bits commented 8 months ago

@hraban Not really, Collisions are a poor expression of the issue, the actual issue is a fundamental property that computers need to preserve to perform any work. There's an important distinction. As a metaphor, think of two goal posts. Everything between the goal posts is a score in football, or in computation is predictable or deterministic. The goal posts are a boundary condition.

That property is a bit abstract but you can test for it just like you test whether the ball is between the two posts for scoring. In the case of the ball, you know if its between the two posts then you've scored and met the condition. This can be a positive or negative comparison (score or miss; present or not present). A collision would the equivalent of anything outside those goal posts (when its not present). But unlike soccer, the distance measured outside the goal posts isn't knowable until it happens; this has to do with 'undecidable' problems (in theoretical math). The average person will have no way of measuring it (or even testing), we know its there, but we can't quantify it in any meaningful way except indirectly because of the nature of the problem. As far as I'm aware, the nature of the problem has to do with the structural parts of number theory, and is somewhat related to the distribution of what we call primes in real number systems, but its a bit more opaque when you are working on other number systems such as fields that are finite, there are even more unknowns.

Fields that are finite are related to the hashes we are talking about. They are most aptly described as some starting point on a map of a circle, where you travel around the circle some arbitrary number of times over some arbitrary changing number of steps to end up at another point that is your destination.

The information of the entire directions you followed and took to get to that destination point is lost so its much harder to go in reverse to get to your starting point. You also can't predict at the start where you'll end up without following those directions. This is a simplification but just about everyone I've met can understand this. Its also fairly easy to prove there are more than one set of instructions that get you to the same destination point (the hash). The paths may converge to a set number of jumps on a paths, or diverge to infinity depending on the field. The paths these take are often referred to in modern algebra (abstract math) as cyclic groups.

This might seem complicated, but all math really does is provide methods to prove relationships between something hold, and describe when they hold (as an element or property in a consistent system), and by extension provide you with information that is not immediately apparent (i.e. property is present, these behaviors are true). Most math taught K-12 is horseshit because they don't provide you with the underlying basics and teach by rote.

There is a ocw.mit.edu course which covers the two properties I mention in System's and Signals (EE major), and the book they use by Oppenheim is a worthwhile read even as it describes it fairly intuitively even if you don't want to touch the math.

Boiled down, Computers only operate and do work because we preserve properties that hold true. We individually often don't have to think about it because its built in by design. The two properties most important for doing work in computation are time invariance (same inputs same outputs regardless of the time you perform something) and determinism; and they hold true starting from the signal domain (hardware) up through various abstractions (that control complexity) in the software stack. Computation requires determinism, time invariance can be somewhat degraded (caching) but it can't be degraded for troubleshooting.

Determinism is described as having only one unique possible output for one unique input, or by extension collections of unique inputs. Given the same inputs every time, it will always output the same.

At the detailed level, a computer reads in data, an opcode (for an operation), and the data to be operated on; it returns the data from the completion of the operation. It moves to the next instruction, reads in the next opcode, and operates on the data returned/provided. It does this infinitely in a loop until you power it off. Its been simplified; but thats it in a nutshell.

If you were to represent each state that this machine is in at any given time as a circle, and connect that circle with a line to another state that its returned, you'll see there is only ever 1 next line going from each circle to the next. These circles and edges (lines), are used in graph theory (discrete mathematics). The predictability of this path to the next leg/operation is determinism. You can have a computer run chaotically, but you can't expect it to do anything useful; it needs a guide it can follow.

The property is so fundamental that it touches on a lot of things including exception or error handling (in code). To do something, the instructions must be present and predictable, and error handling is simply checking for a condition and halting, or handling the error in a proscribed way. You check for a known state and throw an error if its present, handle, or halt; that's error handling.

At its core, a computer operates very dumbly by having only one path they can possibly take for any given set of inputs. It runs in an infinite loop until you power it off, but that loop is controlled and predictable. The nitty-gritty details of theory is covered in a compiler design course (or the dragon book if you self study). Any CS major who legitimately earned their degree knows this.

We can use control flow (if-then statements) to evaluate and determine which action that path goes down from a set of paths but there is only ever one path so long as the property holds true.

At its most basic, the required property/path acts like a rail for the computer to follow. When it has no rail it runs into theoretical problems described by Shannon, Turing, and others. If you'd like further background on those problems, I'll refer you to their works as it is too much to explain all the ways computers fail in a post like this, and they describe those fundamental problems better than I will here.

What's important to know is, there are many problems that fundamentally cannot be determined upfront, or characterized (described, except indirectly), they may also require context a computer (turing machine) cannot evaluate (limitations of computation).

There are problems a computer cannot solve because of fundamental limitations (automata theory). There are problems in math that are "undecidable", this is where there is no algorithm (process) that can provide a definite answer for all instances of the problem, which would be needed to predict with certainty.

Without recognizing these known facts, you often see people trying to be clever, when they actually misapply techniques in logic, and imply by assumption that when you can't provide an answer you are wrong, and it seems like they are right when they are totally wrong. Logical reasoning is rational, flawed reasoning is irrational, and you don't argue with irrational people. The consequences in the form of failures eventually catch up to them.

Those convincing but flawed reasoning structures are called a fallacy, philosophy classes cover them in depth. Proof by contradiction is just one of many acceptable ways to prove a problem is not correct, but the opposite is not necessarily true and in most cases doesn't mean anything. Proof by contradiction is not the only way to prove something, there also proofs by induction. Setting up a provide a contradiction that this is wrong or you are not correct is almost certainly a false dichotomy except in very narrow circumstances.

I didn't feel the need to call Max out on that as any educated person would have seen the flaw, and that flawed line of reasoning has been contradicted in cryptography with actual instances several times over the past several decades, where effective attacks were found to break the algorithm (computationally faster than brute force); but prior to those attacks brute force was assumed the best.

You can introduce non-determinism in software by overloading tokens the computer operates on, this includes the state of absence of a token which is a token itself, such as a null state. If they have multiple meanings that are outside that 1:1 relationship, determinism fails.

Computers fail at this point in completely unknowable ways that cannot be predicted, it may run forever, it may error with an unrelated error, it may halt, it may provide wrong but innocuous output in place of the expected right answer (how are you to know its correct if you don't know the system internals intimately?). The system will only work predictably right up until you hit that boundary condition which can't really be checked in any authoritative way by the computer itself, and when and where those boundary conditions are hit exactly, we can't know upfront given what we currently know about cyclical/finite-fields related to these structures. All material I've read to date suggests this may be an undecidable problem.

Granular Error handling requires determinism. Without it you get error messages that don't point to a root cause.

Troubleshooting only works because of time invariance, and by probing parts of code by holding inputs and outputs constant (determinism). Guess and check isn't troubleshooting. If you get two different outputs for the same inputs you can stop trying to troubleshoot, and know you'll have to guess and check manually. Any professional IT person whose worked in operations for a significant amount of time knows this at an intuitive level. The fundamental reason is because those properties are absent or not held true.

There are many flawed arguments in hashing and cryptography, one is that collisions occur evenly distributed. They don't, they can clump which is why we often use fractional primes in hashing algorithms because it seems to spread the collisions out, but there's no proof as far as I'm aware that using fractional primes imparts that property aside from actual demonstrations which may not hold true in a general case).

Probabilities and likelihoods are often abused to support things that are not true, and often fail to represent conclusions correctly; the most common error is the outputs depend greatly on the choice of priors, and so people often work backwards and use these to support whatever they are trying to say (cognitive bias, or deception).

Probabilities and Likelihoods have a very poor track record when informing at any given time when something happens with any kind of accuracy. Bayes theorem is also often misapplied. I don't put much stock in arguments based on probabilities and likelihoods given the lack of provable credibility and the all too common use of these to support and promote false conclusions.

Non-deterministic problems require manual effort, and full and intimate knowledge and control of the system to identify where the failure is, a normal average person using the software wouldn't be able to fix anything. The only reason you know there's a problem is because its not functioning as expected, and computers can't solve those types of context problems because they can only operate on DFA structure (requires determinism, automata theory, this also includes regex). Human's can reason about the larger group of NFA structures (where two things can mean the same thing), and sometimes break them down into DFA structures.

If you want a perfect practical example of a non-determinism problem take a look at linux's ldd utility. Use ldd on sshd, use grep to filter out all but the libraries it uses. Examine the results, does it include everything it is supposed to? No?

There's no way it will, short of patching the original software to fix that 1:1 property for the output tokens. The tokens used mean two possible different things for the same input; which is depending on context which isn't passed. No automation that gets passed the output from ldd as input can work correctly while that property is broken, but it will still give you an answer, (albeit the incorrect/incomplete one). It would have to be patched since the file being read in is considered part of the inputs. There are three ways it can fail, but only one meeting the boundary condition in this case. It is operating line-by-line and flattening the columns non-deterministically based on if the column is empty or not (null). You would have to reconstitute the columns correctly, and set a unique placeholder but the former part requires information that has already been lost.

Hopefully this provides you and anyone else reading this an intuitive understanding about how computers actually work. Petzold wrote a gentler introduction in his book Code. If this explanation confuses, that might be worth a read.

In wrap up, there are system properties that computers require, they fail when they aren't present, you can use a test for a property to find useful information about the problem domain (deterministic problems cost less to track down), and there are types of problems that cannot be predicted, knowable, or characterized, and problems that cannot be evaluated by computers themselves.

Allowing non-determinism in a system that determines whether the computer can be brought up to run computation or not; is setting up anyone using it for an avalanche of problems later. Problems that are thrust on the people least able to competently resolve the issue. That's a problem.

hraban commented 8 months ago

Let's assume for a second we lived in a universe where sha256 was somehow magically injective (just imagine with me for a second). Would the than3-bits' in that universe be totally happy? Or would there be another problem?

than3-bits commented 8 months ago

@hraban I hate to say this but you are absolutely right about what you said about yourself in the introduction. You spoke the truth, but in goodwill I hoped it was a false belief of yours. I can see now it isn't, but it may not be permanent or un-correctable. Stupid, just like being poor; is a state of mind, and learning and understanding can overcome that state though few do. It is always better to be broke than poor. I can't be more disappointed right now; but that's life and I'll get over it.

I must warn you though, true evil has a way of twisting and eating its own, and one only becomes evil by blinding themselves to it, becoming subordinated by it, and accepting it without resistance. This can only occur through repeated acts of self-corruption and self-violation; calling yourself stupid, idly justifying unjustifiable actions, and seeking to enroll others through reprehensible inducements or other tricks of deceit. I earnestly fear for your immortal soul given what I've seen. I hope you'll turn it around.

There's no point in answering your question, as that is both what you hoped I'd do, and abuse of the contrast principle with a little manipulative trap... this deserves no further response. Its clear that you had no intention of goodwill, when you framed that response and you've shown your true colors. What one does in small things, they do in big things.

I'm not inclined to provide anything to people that act malevolently, and you burned what little credibility you had. It was blatantly manipulative and unsophisticated with no beneficial point.

If you persist doing the same towards others you are going to find the hard truth that others may not be so forgiving or even in control of their impulses; especially when the natural reversal you intended to induce through your actions can result in irrational and sometimes even violent behavior towards the source of the perceived manipulation. I believe Robert Lifton actually covered this quite well in his book of case studies on the Psychology of Totalism.

I hope you'll turn back towards a non-destructive path, but that's up to you to decide. I'll pray for your salvation nonetheless.

Everyone has their own journey and choices to make that's based in a shared common reality. There will always be those that seek to beneficially enrich others lives in goodwill, and those that seek to destroy.

I'm closing this now as there can be no further discussion, rational or otherwise. I'll leave you to reflect. Cheers.

eclairevoyant commented 1 month ago

Yes indeed, true evil lies in sha256 checksums. Well, please do feel free to provide info about your 100% infallible system with no tradeoffs whatsoever, I'd love to switch to it! And if your company is hiring, can you send a link to their jobs page? I'd love to work for a boss who "isn't going to care about probability and likelihood", but instead will pour millions into developing a perfect system from scratch.