llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.03k stars 11.97k forks source link

[lldb] Migrate from MD5 to a (nonbroken) cryptographically secure hash for remote protocol #89271

Open emaste opened 6 months ago

emaste commented 6 months ago

MD5 is susceptible to collisions and is becoming increasingly so, and is thus not appropriate where a cryptographically secure hash is required. Even if the use case in LLDB is unlikely to be subject to a practical attack there is some potential for workflow confusion due to hash collisions - determining that files are identical and thus do not need to be transferred does require a good hash.

MD5 has been known to be weak for quite some time, but increasingly simple collisions are being found - for example, these two 72-character ASCII strings differ by one byte, but have identical MD5 hashes:

TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak ^ note difference here (from https://twitter.com/realhashbreaker/status/1770161965006008570)

This issue is a followup from https://github.com/llvm/llvm-project/pull/88812 and in particular https://github.com/llvm/llvm-project/pull/88812#issuecomment-2061565125 based on a response to my comment there. I had assumed that the use of MD5 originated with GDB, but the investigation in the pull request shows it was added in https://github.com/llvm/llvm-project/commit/e0f8f574c7f41f5b61ec01aa290d52fc55a3f3c9 and is our extension.

As mentioned in the linked comment we can add a vFile:betterHash (or a more general hash:<kind>) and try to use that, falling back to MD5 if the remote lacks support.

llvmbot commented 6 months ago

@llvm/issue-subscribers-lldb

Author: Ed Maste (emaste)

MD5 is susceptible to collisions and is becoming increasingly so, and is thus not appropriate where a cryptographically secure hash is required. Even if the use case in LLDB is unlikely to be subject to a practical attack there is some potential for workflow confusion due to hash collisions - determining that files are identical and thus do not need to be transferred does require a good hash. MD5 has been known to be weak for quite some time, but increasingly simple collisions are being found - for example, these two 72-character ASCII strings differ by one byte, but have identical MD5 hashes: `TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak` `TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak` ` ^` note difference here (from https://twitter.com/realhashbreaker/status/1770161965006008570) This issue is a followup from https://github.com/llvm/llvm-project/pull/88812 and in particular https://github.com/llvm/llvm-project/pull/88812#issuecomment-2061565125 based on a response to my comment there. I had assumed that the use of MD5 originated with GDB, but the investigation in the pull request shows it was added in https://github.com/llvm/llvm-project/commit/e0f8f574c7f41f5b61ec01aa290d52fc55a3f3c9 and is our extension. As mentioned in the linked comment we can add a `vFile:betterHash` (or a more general `hash:<kind>`) and try to use that, falling back to MD5 if the remote lacks support.
DavidSpickett commented 6 months ago

My comment from the PR:

If it were a GDB packet it would be in https://sourceware.org/gdb/current/onlinedocs/gdb.html/Host-I_002fO-Packets.html#Host-I_002fO-Packets but I see no sign of it, or in GDB's sources.

The first appearance of vFile:MD5 is e0f8f57 though it is not documented as one of our extensions in https://github.com/llvm/llvm-project/blob/main/lldb/docs/lldb-platform-packets.txt.

So a good follow up PR would be to list it in that document. It's self explanatory but still, weird that it's not there.

This means we could add vFile:betterHash (or a more general hash:) and try sending it to the remote. If the remote responds that it doesn't support it, we ask it for MD5. Though we may want a way to say don't use MD5 even if the remote would accept it, if security is the concern (a bit like ssh key formats work).

The hash types we have in llvm/include/llvm/Support/ already are:

It's always possible that there is a collision, but of course moving off of MD5 would reduce the chances. As the hash can't tell us if the file is 100% the same file, but we're working over a remote link so we can't compare the files directly as it would defeat the point of the check.

We could also use of vFile:size: https://github.com/llvm/llvm-project/blob/851462fcaa7f6e3301865de84f98be7e872e64b6/lldb/docs/lldb-platform-packets.txt#L261 Or ask for MD5 and the new type.

Though I suspect that those ideas wouldn't reduce the chances much, the big win comes from changing hash type.