Open mratsim opened 6 years ago
Mentionned by @arnetheduck , C++17 introduced a distinct type called byte
enum class byte : unsigned char {} ;
(since C++17)
std::byte is a distinct type that implements the concept of byte as specified in the C++ language definition.
Like char and unsigned char, it can be used to access raw memory occupied by other objects (object representation), but unlike those types, it is not a character type and is not an arithmetic type. A byte is only a collection of bits, and only bitwise logic operators are defined for it.
Yes! This has been annoying me for quite some while now. Another thing that should be taken into consideration is reading of data from a stream and writing to one. Whatever solution this comes to, the corresponding base type should be added to the streams module as well so as to facilitate easy input -> en-/decrypt -> output.
Personally I agree with the rationale offered, and while the DSP argument is valid Nim already has things like int
which can take on multiple sizes. Doing the same for byte makes sense. I could write a crypto procedure that takes byte8
and the compiler will check that code using byte
actually has the right sized byte.
Another consideration for the API is streaming - for this it's customary to have a state that goes through init, update (many times) and finalization..
there can then be convenience helpers that call these three functions for non-streaming cases
I would vote for: proc hash1234(s: string): Hash = ...
. string
in Nim (e.g. in stdlib) represents not only text, but also binary data.
Hash
should be distinct array[byte, N]
- byte
is already defined as an alias to uint8
. We should not worry about DSP architectures, it is very unlikely Nim will be used on one of these.
What's the rationale for having a distinct Hash
type? Couldn't we simply get away with proc hash(s: string): string
?
As far as representing binary data in Nim is concerned, I agree with @zielmicha, string
should be used.
this byte vs uint8 vs char or string/seq problem is not exclusively belongs to crypto/hashes library. Graphics library also suffer from the same problem. How to consistently represent pixels data for graphics API.
let's say we already reach consensus here and adopt a common standard. but the problem is still there, the output from other library, e.g. output from graphics library we feed into crypto/hashes library still need go through some sort of conversion. this additional conversion step might not be acceptable under some circumstances.
therefore I suggest we could use Nim provided features such as generics or template to create a flexible API that capable of accepting various inputs and produce various outputs. I know this involved quite a lot of work and required more elaborate test, but at least it can prevent future war on types.
@dom96 and @zielmicha, why reuse the string type for something that's not intuitively a string? Nim has a nice powerful type system, so no need to represent two different things with the same type. Differentiating them means that we could even do nice things like having a $ for byte data that outputs hexadecimal, and we can avoid passing a string to something that takes a byte array and vice-versa. I would say we should go the opposite direction and specify a bytes
type alias in the standard library that is actually a seq[byte]
or an openarray[byte]
or even a concept of indexable container of bytes to really make it clear that it is the preferred way of representing byte data.
@PMunch Then you can't do hash123("hello world")
.
The underlying problem is that there is no definite distinction what is a binary string and what is a text string.
Python 3 has the same problem. Note how it needs separate stream types for text and binary IO.
I agree with PMunch, it's not C, we shouldn't use char *
(string) for data blob. I also like the $
outputting hex data by defaut for openarray[byte]
@jangko not ideal but casting works until everyone migrates to the same standard
Note that @zah started a library just to deal with this issue when receiving data from external sources: https://github.com/status-im/nim-ranges.
@zielmicha
proc hash123(s: openarray[byte]): array[N, byte] =
...
proc hash123(s: string): array[N, byte] =
...
proc hash123_fromHex(s: string): array[N, byte] =
...
Also maybe it would be worth it to introduce a hex
package in the stdlib, with:
type HexString = distinct string
@zielmicha basically what @mratsim said, but you could also have a converter from string
to bytes
to make strings work everywhere that bytes work, but still keep bytes a separate type. mratsims example of what multiple hash functions would look like looks shockingly similar to how Ada does it. The Bytes type there is declared in another part of the library, but that library implements all kinds of different crypto algorithms, something that Nim has in multiple different modules. Ada is known for it's type safety and I think taking a page from their book on this is a good way to go.
The underlying problem is that there is no definite distinction what is a binary string and what is a text string.
Couldn't agree more, and this is what we now have the possibility to add to Nim. By having a bytes type we can easily make the distinction clear, both to the programmer, and to the reader.
@Araq feedback on IRC
input type: openArray[byte] output type: array[X, byte] where X depends on the concrete hashing algorithm
Just last night, I was thinking of writing a Nim API to abstract the C interop in nimssl which is a wrapper for Openssl. Right now, only the SHA and AES portions are wrapped.
Based on what is agreed here, accordingly the API can be adjusted. The underlying implementation returns an array of chars which I accordingly cast and then convert to a hex string.
Mind adding https://nimble.directory/pkg/libsodium https://github.com/federicoceratto/nim-libsodium to the list?
Here is my proposal. A_
here stands for the concrete algorithm (SHA264, etc.).
type
A_Digest* = distinct array[X, byte]
A_Context* = object ## accumulator state
# fields here
proc init*(a: var A_Context) = ...
proc update*(a: var A_Context; bytes: openArray[byte]) = ...
proc finish*(a: var A_Context): A_Digest = ...
proc toHex*(a: A_Digest; upperCase: bool = false): string = ...
This is the minimal protocol a hashing algorithm should support. The API introduces no overhead as far as I can see and it supports "streaming". On top of that either via a concept
or via dynamic binding a dispatcher can be implemented, but first we also need a convenience template that is algorithm agnostic:
template hashBytes*(algorithm: typedesc; bytes: openArray[byte]; upperCase = false): string =
var context: algorithm
init(context)
update(context, bytes)
toHex(finish(context), upperCase)
And let's also have a toBytes
view for string
:
template toBytes*(x: string): seq[byte] =
cast[seq[byte]](x)
Thankfully seq[byte]
can be passed to openArray[byte]
without a copy.
If we are going to define a minimal set of types/concepts in the standard lib, I think we should define a single Digest
type with a parametric compile-time size.
type
MDigest*[bits: static[int]] = object
data*: array[bits div 8, uint8]
Currently, we have such a type in Nimcrypto: https://github.com/cheatfate/nimcrypto/blob/master/hash.nim#L6
Regarding the use of openarray
in the signature of update
, in our keccak lib, I've chosen to use a custom MemRange
type that can be manually constructed from more input data types (not just from strings, seq and arrays, which are required by openarray):
https://github.com/status-im/nim-ranges/blob/master/ranges/memranges.nim
My long-term vision is that this MemRange
type will become a converter concept and the user will be able to provide their own toMemRange
proc for custom types. Perhaps openarray
can gain some options for manual construction as well, which will eliminate the need for a separate type, but right now in networking code in particular it's crucial that we able to construct a MemRange
for a sub-slice of the bytes arrived from the network.
Please note that Nimcrypto already has a similar function to the hashBytes
one proposed here:
https://github.com/cheatfate/nimcrypto/blob/master/hash.nim#L17
The difference is that it returns a fixed-size Digest struct, which can be then converted to string if the user needs a string representation.
The only reason why openarray[T]
is not an option, is it limitations. For example openarray[T]
could not be used as argument in iterator/closure iterator procedure. Also you are not able to create new openarray[T]
from some data you have, you can only use it as argument. So its why MemRange
is the only usable solution.
Perhaps openarray can gain some options for manual construction as well, which will eliminate the need for a separate type, but right now in networking code in particular it's crucial that we able to construct a MemRange for a sub-slice of the bytes arrived from the network.
We definitely need a way to construct an openarray from the underlying (pointer, length) pair. Probably yet another builtin for system.nim. It's true that my proposal depends on this missing piece.
The difference is that it returns a fixed-size Digest struct, which can be later converted to string if the user needs a string representation.
Well that is easily done in my proposal too but it seemed rarely useful. In the end the different algorithms return different object types so if I need some common result representation it's often a string
.
People want strings on their screen, but our code strongly prefers the tightly-packed bytes of the Digest structure (for storing them in memory and databases, for putting them in binary-encoded network messages and so on).
We definitely need a way to construct an openarray from the underlying (pointer, length) pair. Probably yet another builtin for system.nim. It's true that my proposal depends on this missing piece.
We also need a way to get sub-slices of the openarrays and a way to keep them around in memory in async code (as @cheatfate mentioned). In our project, I've currently also defined a ByteRange
object that cheats by storing a reference to a shallow
sequence, but I hope to optimize this in the future by either implementing a native garbage collected range type or by using the new destructors run-time to manage the memory manually.
Here is the ByteRange
type:
https://github.com/status-im/nim-rlp/blob/master/rlp/types.nim#L7
But clearly, this problem is separate from the interface issue discussed here. With "manual" openarray construction it will be possible to adapt any range implementation to the proposed interface.
My sysrandom library also suffers from this kind of problem. Currently it can write the random bytes into a pointer, array of static[int]
size or a seq
Having an agreed upon standard type would be very useful IMO. Node.js for example have the Buffer type for this kind of purpose.
Would a binarytypes
or cryptotypes
package that just defines the bare minimum help for standardization across packages?
type
MDigest*[bits: static[int]] = object
# Message digest result of any hash function
data*: array[bits div 8, uint8]
type HexString = distinct string
proc `$`[bits: static[int]](d: MDigest[bits]): HexString
const hexChars = "0123456789abcdef"
result = newString(2*N div 8)
for i in 0 ..< N:
result[2*i] = hexChars[int ba[i] shr 4 and 0xF]
result[2*i+1] = hexChars[int ba[i] and 0xF]
proc readHexChar(c: char): byte {.noSideEffect, inline.}=
## Converts an hex char to a byte
case c
of '0'..'9': result = byte(ord(c) - ord('0'))
of 'a'..'f': result = byte(ord(c) - ord('a') + 10)
of 'A'..'F': result = byte(ord(c) - ord('A') + 10)
else:
raise newException(ValueError, $c & "is not a hexademical character")
proc hexToMDigest*[bits: static[int]](hexStr: string): MDigest[bits] {.noSideEffect, noInit.}=
## Read an hex string and store it in a message Digest
var i = 0
if hexStr[i] == '0' and (hexStr[i+1] == 'x' or hexStr[i+1] == 'X'):
inc(i, 2) # Ignore 0x and 0X prefix
assert hexStr.len - i == 2 * (N div 8)
while i < N:
result[i] = hexStr[2*i].readHexChar shl 4 or hexStr[2*i+1].readHexChar
inc(i)
proc hexToSeqByte*(hexStr: HexString): seq[byte] {.noSideEffect.}=
## Read an hex string and store it in a sequence of bytes in Big-Endian order
var i = 0
if hexStr[i] == '0' and (hexStr[i+1] == 'x' or hexStr[i+1] == 'X'):
inc(i, 2) # Ignore 0x and 0X prefix
let N = (hexStr.len - i) div 2
result = newSeq[byte](N)
while i < N:
result[i] = hexStr[2*i].readHexChar shl 4 or hexStr[2*i+1].readHexChar
inc(i)
proc toBytes(s: string): seq[byte] {.inline.}=
cast[seq[byte]](s)
MDigest
can be called Buffer
or MemRange
of course
Actually I really like the (stack) Buffer
type idea of @euantorano, it can then be customized for the needs of graphics, crypto, networking, multi-precision maths.
@FedericoCeratto, @genotrance, @cheatfate I've updated the original post with libsodium (input/output: string), nimssl (input/output: ptr cuchar + len) and nimcrypto (input: ptr uint8 + len) (further overloads coming to nimcrypto)
corresponding C++ proposal for span / memrange: http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0122r1.pdf
the Buffer
looks like it owns memory and makes decisions about where that memory is at (heap) - for input to hash functions, that's not really necessary - a view will do just fine. for output, hashes are fixed-length.
if doing crypto functions instead (en/decrypt), it might make sense to have caller manage memory so as not to make decisions for them - that would mean an api that takes an openarray-like input and a c++-span-like output (or streams, though that can usually be built as a layer on top).
the same kind of api would be good for compression too, and a wide range of other use cases as others have noted.
There is some discussion about views/memranges here: #5753
It was settled that openarray[T] will have this role (#5957).
Tagging @krux02 as he already requested C++ span/view/memrange/slice/openarray here https://github.com/nim-lang/Nim/issues/5437
In the wild:
span
,Open Array
Slices are a view into a block of memory represented as a pointer and a length.
@mratsim Thanks for the invite. I will now also add my two cents. I just started reading this thread, and yes I agree, a consistent hashing interface is important. I like the interface that Araq suggested. It looks a lot like the old hashing interface, but without the ugly hashing operator definitions. In my opinion the hashing operators are a bad example for operator overloading/definition.
I just don't agree with the suggested convenience template. It is called hashBytes
but the argument already says that it hashes bytes, twice in the name of the argument and the type. So that part is redundant, but it returns a string that holds the hex representation of the input bytes, and that is not obvious from anything at all. So I would just call it hashToHexStr
, and make it a procedure. But it was probably just meant as an example and discussing it's name is pointless.
The only case that I would like to throw into the pot to think about is to not make runtime branching on hashing algorithms easy. But with the proposed interface it should work nicely to implement something like this:
proc hash(hashKind: HashKind; bytes: openarray[byte]; dst: var openarray[byte]): string =
case hashKind
of ...
Looking back at some of the older messages, in reference to @Araq's reply, I'm incredibly surprised that this is valid Nim code:
template toBytes*(x: string): seq[byte] =
cast[seq[byte]](x)
Is it really valid? Are sequences and strings really that similar?
My reasoning for preferring string
is very simple: C compatibility. Isn't it the case that most C libraries return binary data in the form a char *
?
I was under the impression that converting this data into a seq[byte]
would have a significant overhead, but it appears that is not the case. So yeah... let's adopt this approach and make working with binary data easy (it's currently quite a PITA, ideally it would be nice to have something like Python's struct
module in the stdlib... but I'm getting offtopic here).
@krux02 will your case
work at compile time? I would separate the HaskKind and the string conversion.
proc hash[H: HashKind](bytes: openarray[byte]): HashKind =
when H is: ...
@dom96, byte = char (from C standard: a char and a byte are the smallest addressable unit of memory) and in 99% of the case char = uint8 (except the DSP i mentionned).
Edit: In the past openarray[char]
used to match with string
input and caused problem. They are indeed the same.
Also for C compatibility we have cstring
and pointer
.
I already said it in the chat, but I will repeat it here as well. When you abuse a string as binary data. Be aware that you are in fact abusing the sting type. Never ever expose this abuse in the public API. And to make sure you never expose this abuse in the public API, just also don't use it internally.
And to prefere string for C compatibility, is wrong in several layers. First of all, when you warp a C library, provide a nice Nim wrapper. Please don't expose c types in the Nim API. If you do so you force everybody who uses that library to put the conversion at the usage side. And that is just not very nice. And then, string is not any less or more compatible with C than any other type in Nim.
the name hash
tends to be used for a non-cryptographic hash that's suitable for dictionaries - is there need for some distinguishment there?
As a first step we released a byteutils package to deal with byte <-> hex conversion to and from seq[byte] and openarray[byte].
Also regarding @arnetheduck remark, we settled so far on the MDigest (Message Digest) type for crypto hashes defined in nimcrypto here.
Basically all hashes provide the following interface:
MDigest
type (array of byte)HashType
/ HashKind
(sha2, sha3, AES ...)Context
for incremental hashing
HashType.digest(foo)
for stateless/context-free hashingtype
MDigest*[bits: static[int]] = object
data*: array[bits div 8, byte]
bchar* = byte | char
proc `$`*(digest: MDigest): string =
result = ""
var i = 0'u
while i < uint(len(digest.data)):
result &= hexChar(cast[byte](digest.data[i]))
inc(i)
proc digest*(HashType: typedesc, data: ptr byte,
ulen: uint): MDigest[HashType.bits] =
mixin init, update, finish, clear
var ctx: HashType
ctx.init()
ctx.update(data, ulen)
result = ctx.finish()
ctx.clear()
proc digest*[T](HashType: typedesc, data: openarray[T],
ostart: int = 0, ofinish: int = -1): MDigest[HashType.bits] =
mixin init, update, finish, clear
var ctx: HashType
let so = if ostart < 0: (len(data) + ostart) else: ostart
let eo = if ofinish < 0: (len(data) + ofinish) else: ofinish
let length = (eo - so + 1) * sizeof(T)
ctx.init()
if length <= 0:
result = ctx.finish()
else:
ctx.update(cast[ptr byte](unsafeAddr data[so]), uint(length))
result = ctx.finish()
ctx.clear()
In the HashType.bits bits is a static[int] as shown below.
The definition of a HashType, for example sha3/keccak:
type
KeccakKind* = enum
Sha3, Keccak, Shake
KeccakContext*[bits: static[int],
kind: static[KeccakKind]] = object
q: array[25, uint64]
pt: int
For the byte format it would be nice to have a corresponding stream in the streams
module as well. So you would be able to easily read data in from, and write data out to such a stream.
Yes, a corresponding stream type would be really useful.
On 3 Apr 2018, at 08:18, PMunch notifications@github.com wrote:
For the byte format it would be nice to have a corresponding stream in the streams module as well. So you would be able to easily read data in from, and write data out to such a stream.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I created a ByteStream
stream type in this PR. It's basically a copy-paste of the StringStream
as they share a lot of semantics.
If I may add my two cents, a hash function should support a stack result
foremost. If something like a seq[byte]
result is convenient, that should be implemented in a wrapper. A general hash API should also support variable output length, though not necessarily at runtime. The NIST has standardized the SHAKE functions for this purpose, and all SHA-3 variants should fit comfortably in the API.
I've had a look into the crypto packages in stdlib (md5 and sha1). I don't see away to gradually make the change with a {.deprecated.} pragma.
The way forward would be to deprecate those packages altogether and either redirect to nimble packages or implement a revamped crypto stdlib with the desired API.
fwiw, https://github.com/status-im/nim-stew is the new place for experimental API's that we produce (byteutils moved in there)
A cryptographer and security auditor just made an article about the security issues caused by using strings instead of byte arrays: https://littlemaninmyhead.wordpress.com/2021/09/15/if-you-copied-any-of-these-popular-stackoverflow-encryption-code-snippets-then-you-did-it-wrong/
As we're working with crypto at Status, we regularly get bitten by the different types in Nim libraries to represent binary data.
Context:
Crypto and hashing libraries in the wild uses arrays or seq of uint8, char, byte, object, uint32, cuchar and also strings to represent the binary hashes. I have noted the following issues:
A. Output hash type is inconsistent:
B. 0.18 introduction of
$
for arrays broke libraries:C. Input type is inconsistent:
D.
$
Conversion of the hash type is inconsistent between string, hex or binary representation:$
on array[N, char] (N = 28, 32, 48, 64 only) returns a binary stringbin2hex
for the hex representation.$
for array of bytes (returns a string of uint8) and requiretoHex
for the hex representation$
(it's a pointer). It requirestoHex
for the hex representation.Note that we are currently having this discussion internally, but since Araq mentionned his pain working with hashes following 0.18, I believe the whole Nim ecosystem would benefit.
Stakes
Nimbus (A client for Ethereum 2.0) will bring a lot of attention to Nim crypto ecosystem and it's in our best interest to adopt healthy common standards. Beyond the internal representations, I think a common set of APIs is what matters
Questions and considerations for API:
1. "String" input
proc hash1234(s: string): Hash = ...
, what shoulds
be, a hex string or binary data?2. Variable size input
Which one(s) among
proc hash1234(s: seq[uint8]): Hash = ...
proc hash1234(s: seq[byte]): Hash = ...
proc hash1234(s: seq[char]): Hash = ...
proc hash1234(s: string): Hash = ...
3.
$
behaviourWhich one among:
4. Chaining
hash1234(hash1234(foo))
should be allowed without casting/conversion5. Hex output
Which one among:
$
toHex
Opinions
For consistency one type between byte/uint8/char should be used. At the same time one type between seq[byte]/seq[uint8]/seq[char] and string should be used.
Note: Those are only my opinions and does not represent the opinion of all Status devs
Opinion 1:
$
is very error prone right now. From an ergonomic point of view, it makes sense to convert binary blob/hashes to their hex representation.Opinion 2:
$
converting to the string repr of seq[uint8] can be avoided by usingdistinct
types or wrapping Hash = array[N, byte|uint8] in an object.Opinion 3: The internal representation should be byte.
Rationale:
A counterpoint can be made that byte (or char) hides the actual size of the type. However Windows and POSIX require 1 byte = 8-bit. The only systems with 1 byte != uint8 are DSP chips (they use 1 byte = 12, 16, 24 bit), see here.
Opinion 4: Let's nail it! :fireworks:
Inviting: @jangko, @cheatfate, @Araq, @zah, @yglukhov to the discussion.