mratsim commented 6 years ago

As we're working with crypto at Status, we regularly get bitten by the different types in Nim libraries to represent binary data.

Context:

Crypto and hashing libraries in the wild uses arrays or seq of uint8, char, byte, object, uint32, cuchar and also strings to represent the binary hashes. I have noted the following issues:

A. Output hash type is inconsistent:

array[N, char]:
- nimSHA2
array[N, uint8]:
- md5
object-wrapped array[bits div 8, uint8] (number of bits is part of the type):
- keccak_tiny (Status)
- nimcrypto (Status)
array[N, byte]:
- Eth-keys (Status)
distinct array[N, uint32]:
- SHA1
string
- libsodium
ptr cuchar + len
- nimssl (bindings autogenerated on install so not visible on Github)

B. 0.18 introduction of $ for arrays broke libraries:

C. Input type is inconsistent:

a string (binary buffer)
- nimSHA2
- libsodium
a hex string or a binary buffer (any kind: string, arrays, sequences)
- keccak_tiny (Status) and its dependency memRange package
a hex string
- eth_keys (Status)
ptr cuchar + len
- nimssl (bindings autogenerated on install so not visible on Github)
ptr uint8 + len
- nimcrypto. Further overloads are coming.

D. $ Conversion of the hash type is inconsistent between string, hex or binary representation:

nimSHA2 $ on array[N, char] (N = 28, 32, 48, 64 only) returns a binary string
- every other size returns the string representation of the numbers
libsodium also returns a binary string and requires bin2hex for the hex representation.
keccak_tiny (Status) returns the hex representation
md5 returns the hex representation
nim-eth-keys (Status) uses the default $ for array of bytes (returns a string of uint8) and require toHex for the hex representation
nimssl doesn't have $(it's a pointer). It requires toHex for the hex representation.

Note that we are currently having this discussion internally, but since Araq mentionned his pain working with hashes following 0.18, I believe the whole Nim ecosystem would benefit.

Stakes

Nimbus (A client for Ethereum 2.0) will bring a lot of attention to Nim crypto ecosystem and it's in our best interest to adopt healthy common standards. Beyond the internal representations, I think a common set of APIs is what matters

Questions and considerations for API:

1. "String" input

proc hash1234(s: string): Hash = ..., what should s be, a hex string or binary data?

2. Variable size input

Which one(s) among

proc hash1234(s: seq[uint8]): Hash = ...
proc hash1234(s: seq[byte]): Hash = ...
proc hash1234(s: seq[char]): Hash = ...
proc hash1234(s: string): Hash = ...

3. $ behaviour

Which one among:

An error
The hex representation
The uint8/byte numbers
Convert to binary blob

4. Chaining

hash1234(hash1234(foo)) should be allowed without casting/conversion

5. Hex output

Which one among:

$
toHex

Opinions

For consistency one type between byte/uint8/char should be used. At the same time one type between seq[byte]/seq[uint8]/seq[char] and string should be used.

Note: Those are only my opinions and does not represent the opinion of all Status devs

Opinion 1: $ is very error prone right now. From an ergonomic point of view, it makes sense to convert binary blob/hashes to their hex representation.

Opinion 2: $ converting to the string repr of seq[uint8] can be avoided by using distinct types or wrapping Hash = array[N, byte|uint8] in an object.

Opinion 3: The internal representation should be byte.

Rationale:

byte is for data or binary blob that are read/written
uint8 is for numbers (add, multiply, divide, modulo ...)
char is for text that can be printed.

A counterpoint can be made that byte (or char) hides the actual size of the type. However Windows and POSIX require 1 byte = 8-bit. The only systems with 1 byte != uint8 are DSP chips (they use 1 byte = 12, 16, 24 bit), see here.

Opinion 4: Let's nail it! :fireworks:

Inviting: @jangko, @cheatfate, @Araq, @zah, @yglukhov to the discussion.

mratsim commented 6 years ago

Mentionned by @arnetheduck , C++17 introduced a distinct type called byte

enum class byte : unsigned char {} ; (since C++17)

std::byte is a distinct type that implements the concept of byte as specified in the C++ language definition.

Like char and unsigned char, it can be used to access raw memory occupied by other objects (object representation), but unlike those types, it is not a character type and is not an arithmetic type. A byte is only a collection of bits, and only bitwise logic operators are defined for it.

PMunch commented 6 years ago

Yes! This has been annoying me for quite some while now. Another thing that should be taken into consideration is reading of data from a stream and writing to one. Whatever solution this comes to, the corresponding base type should be added to the streams module as well so as to facilitate easy input -> en-/decrypt -> output.

Personally I agree with the rationale offered, and while the DSP argument is valid Nim already has things like int which can take on multiple sizes. Doing the same for byte makes sense. I could write a crypto procedure that takes byte8 and the compiler will check that code using byte actually has the right sized byte.

arnetheduck commented 6 years ago

Another consideration for the API is streaming - for this it's customary to have a state that goes through init, update (many times) and finalization..

there can then be convenience helpers that call these three functions for non-streaming cases

zielmicha commented 6 years ago

I would vote for: proc hash1234(s: string): Hash = .... string in Nim (e.g. in stdlib) represents not only text, but also binary data.

Hash should be distinct array[byte, N] - byte is already defined as an alias to uint8. We should not worry about DSP architectures, it is very unlikely Nim will be used on one of these.

dom96 commented 6 years ago

What's the rationale for having a distinct Hash type? Couldn't we simply get away with proc hash(s: string): string?

As far as representing binary data in Nim is concerned, I agree with @zielmicha, string should be used.

jangko commented 6 years ago

this byte vs uint8 vs char or string/seq problem is not exclusively belongs to crypto/hashes library. Graphics library also suffer from the same problem. How to consistently represent pixels data for graphics API.

let's say we already reach consensus here and adopt a common standard. but the problem is still there, the output from other library, e.g. output from graphics library we feed into crypto/hashes library still need go through some sort of conversion. this additional conversion step might not be acceptable under some circumstances.

therefore I suggest we could use Nim provided features such as generics or template to create a flexible API that capable of accepting various inputs and produce various outputs. I know this involved quite a lot of work and required more elaborate test, but at least it can prevent future war on types.

PMunch commented 6 years ago

@dom96 and @zielmicha, why reuse the string type for something that's not intuitively a string? Nim has a nice powerful type system, so no need to represent two different things with the same type. Differentiating them means that we could even do nice things like having a $ for byte data that outputs hexadecimal, and we can avoid passing a string to something that takes a byte array and vice-versa. I would say we should go the opposite direction and specify a bytes type alias in the standard library that is actually a seq[byte] or an openarray[byte] or even a concept of indexable container of bytes to really make it clear that it is the preferred way of representing byte data.

zielmicha commented 6 years ago

@PMunch Then you can't do hash123("hello world").

The underlying problem is that there is no definite distinction what is a binary string and what is a text string.

Python 3 has the same problem. Note how it needs separate stream types for text and binary IO.

mratsim commented 6 years ago

I agree with PMunch, it's not C, we shouldn't use char * (string) for data blob. I also like the $ outputting hex data by defaut for openarray[byte]

@jangko not ideal but casting works until everyone migrates to the same standard

Note that @zah started a library just to deal with this issue when receiving data from external sources: https://github.com/status-im/nim-ranges.

@zielmicha

proc hash123(s: openarray[byte]): array[N, byte] =
  ...

proc hash123(s: string): array[N, byte] =
  ...

proc hash123_fromHex(s: string): array[N, byte] =
  ...

Also maybe it would be worth it to introduce a hex package in the stdlib, with: type HexString = distinct string

PMunch commented 6 years ago

@zielmicha basically what @mratsim said, but you could also have a converter from string to bytes to make strings work everywhere that bytes work, but still keep bytes a separate type. mratsims example of what multiple hash functions would look like looks shockingly similar to how Ada does it. The Bytes type there is declared in another part of the library, but that library implements all kinds of different crypto algorithms, something that Nim has in multiple different modules. Ada is known for it's type safety and I think taking a page from their book on this is a good way to go.

The underlying problem is that there is no definite distinction what is a binary string and what is a text string.

Couldn't agree more, and this is what we now have the possibility to add to Nim. By having a bytes type we can easily make the distinction clear, both to the programmer, and to the reader.

mratsim commented 6 years ago

@Araq feedback on IRC

input type: openArray[byte] output type: array[X, byte] where X depends on the concrete hashing algorithm

genotrance commented 6 years ago

Just last night, I was thinking of writing a Nim API to abstract the C interop in nimssl which is a wrapper for Openssl. Right now, only the SHA and AES portions are wrapped.

Based on what is agreed here, accordingly the API can be adjusted. The underlying implementation returns an array of chars which I accordingly cast and then convert to a hex string.

FedericoCeratto commented 6 years ago

Mind adding https://nimble.directory/pkg/libsodium https://github.com/federicoceratto/nim-libsodium to the list?

Araq commented 6 years ago

Here is my proposal. A_ here stands for the concrete algorithm (SHA264, etc.).


type
  A_Digest* = distinct array[X, byte]
  A_Context* = object ## accumulator state
    # fields here

proc init*(a: var A_Context) = ...
proc update*(a: var A_Context; bytes: openArray[byte]) = ...
proc finish*(a: var A_Context): A_Digest = ...
proc toHex*(a: A_Digest; upperCase: bool = false): string = ...

This is the minimal protocol a hashing algorithm should support. The API introduces no overhead as far as I can see and it supports "streaming". On top of that either via a concept or via dynamic binding a dispatcher can be implemented, but first we also need a convenience template that is algorithm agnostic:

template hashBytes*(algorithm: typedesc; bytes: openArray[byte]; upperCase = false): string =
  var context: algorithm
  init(context)
  update(context, bytes)
  toHex(finish(context), upperCase)

And let's also have a toBytes view for string:

template toBytes*(x: string): seq[byte] = 
  cast[seq[byte]](x)

Thankfully seq[byte] can be passed to openArray[byte] without a copy.

zah commented 6 years ago

If we are going to define a minimal set of types/concepts in the standard lib, I think we should define a single Digest type with a parametric compile-time size.

type
  MDigest*[bits: static[int]] = object
    data*: array[bits div 8, uint8]

Currently, we have such a type in Nimcrypto: https://github.com/cheatfate/nimcrypto/blob/master/hash.nim#L6

Regarding the use of openarray in the signature of update, in our keccak lib, I've chosen to use a custom MemRange type that can be manually constructed from more input data types (not just from strings, seq and arrays, which are required by openarray): https://github.com/status-im/nim-ranges/blob/master/ranges/memranges.nim

My long-term vision is that this MemRange type will become a converter concept and the user will be able to provide their own toMemRange proc for custom types. Perhaps openarray can gain some options for manual construction as well, which will eliminate the need for a separate type, but right now in networking code in particular it's crucial that we able to construct a MemRange for a sub-slice of the bytes arrived from the network.

zah commented 6 years ago

Please note that Nimcrypto already has a similar function to the hashBytes one proposed here:

https://github.com/cheatfate/nimcrypto/blob/master/hash.nim#L17

The difference is that it returns a fixed-size Digest struct, which can be then converted to string if the user needs a string representation.

cheatfate commented 6 years ago

The only reason why openarray[T] is not an option, is it limitations. For example openarray[T] could not be used as argument in iterator/closure iterator procedure. Also you are not able to create new openarray[T] from some data you have, you can only use it as argument. So its why MemRange is the only usable solution.

Araq commented 6 years ago

Perhaps openarray can gain some options for manual construction as well, which will eliminate the need for a separate type, but right now in networking code in particular it's crucial that we able to construct a MemRange for a sub-slice of the bytes arrived from the network.

We definitely need a way to construct an openarray from the underlying (pointer, length) pair. Probably yet another builtin for system.nim. It's true that my proposal depends on this missing piece.

Araq commented 6 years ago

The difference is that it returns a fixed-size Digest struct, which can be later converted to string if the user needs a string representation.

Well that is easily done in my proposal too but it seemed rarely useful. In the end the different algorithms return different object types so if I need some common result representation it's often a string.

zah commented 6 years ago

People want strings on their screen, but our code strongly prefers the tightly-packed bytes of the Digest structure (for storing them in memory and databases, for putting them in binary-encoded network messages and so on).

zah commented 6 years ago

We definitely need a way to construct an openarray from the underlying (pointer, length) pair. Probably yet another builtin for system.nim. It's true that my proposal depends on this missing piece.

We also need a way to get sub-slices of the openarrays and a way to keep them around in memory in async code (as @cheatfate mentioned). In our project, I've currently also defined a ByteRange object that cheats by storing a reference to a shallow sequence, but I hope to optimize this in the future by either implementing a native garbage collected range type or by using the new destructors run-time to manage the memory manually.

Here is the ByteRange type: https://github.com/status-im/nim-rlp/blob/master/rlp/types.nim#L7

But clearly, this problem is separate from the interface issue discussed here. With "manual" openarray construction it will be possible to adapt any range implementation to the proposed interface.

euantorano commented 6 years ago

My sysrandom library also suffers from this kind of problem. Currently it can write the random bytes into a pointer, array of static[int] size or a seq Having an agreed upon standard type would be very useful IMO. Node.js for example have the Buffer type for this kind of purpose.

mratsim commented 6 years ago

Would a binarytypes or cryptotypes package that just defines the bare minimum help for standardization across packages?

type
  MDigest*[bits: static[int]] = object
    # Message digest result of any hash function
    data*: array[bits div 8, uint8]

type HexString = distinct string

proc `$`[bits: static[int]](d: MDigest[bits]): HexString
  const hexChars = "0123456789abcdef"

  result = newString(2*N div 8)
  for i in 0 ..< N:
    result[2*i] = hexChars[int ba[i] shr 4 and 0xF]
    result[2*i+1] = hexChars[int ba[i] and 0xF]

proc readHexChar(c: char): byte {.noSideEffect, inline.}=
  ## Converts an hex char to a byte
  case c
  of '0'..'9': result = byte(ord(c) - ord('0'))
  of 'a'..'f': result = byte(ord(c) - ord('a') + 10)
  of 'A'..'F': result = byte(ord(c) - ord('A') + 10)
  else:
    raise newException(ValueError, $c & "is not a hexademical character")

proc hexToMDigest*[bits: static[int]](hexStr: string): MDigest[bits] {.noSideEffect, noInit.}=
  ## Read an hex string and store it in a message Digest
  var i = 0
  if hexStr[i] == '0' and (hexStr[i+1] == 'x' or hexStr[i+1] == 'X'):
    inc(i, 2) # Ignore 0x and 0X prefix

  assert hexStr.len - i == 2 * (N div 8)

  while i < N:
    result[i] = hexStr[2*i].readHexChar shl 4 or hexStr[2*i+1].readHexChar
    inc(i)

proc hexToSeqByte*(hexStr: HexString): seq[byte] {.noSideEffect.}=
  ## Read an hex string and store it in a sequence of bytes in Big-Endian order
  var i = 0
  if hexStr[i] == '0' and (hexStr[i+1] == 'x' or hexStr[i+1] == 'X'):
    inc(i, 2) # Ignore 0x and 0X prefix

  let N = (hexStr.len - i) div 2

  result = newSeq[byte](N)
  while i < N:
    result[i] = hexStr[2*i].readHexChar shl 4 or hexStr[2*i+1].readHexChar
    inc(i)

proc toBytes(s: string): seq[byte] {.inline.}=
  cast[seq[byte]](s)

MDigest can be called Buffer or MemRange of course

mratsim commented 6 years ago

Actually I really like the (stack) Buffer type idea of @euantorano, it can then be customized for the needs of graphics, crypto, networking, multi-precision maths.

mratsim commented 6 years ago

@FedericoCeratto, @genotrance, @cheatfate I've updated the original post with libsodium (input/output: string), nimssl (input/output: ptr cuchar + len) and nimcrypto (input: ptr uint8 + len) (further overloads coming to nimcrypto)

data-man commented 6 years ago

I like this RFC, thanks! :-)

Not all hash algorithms (especially non-crypto) can be incremental but can have a seed parameter. My suggestion:

proc init*(a: var A_Context, seed: openArray[byte])

or/and

proc setSeed*(a: var A_Context, seed: openArray[byte])

For thought: cryptopp botan botan (D port)

arnetheduck commented 6 years ago

corresponding C++ proposal for span / memrange: http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0122r1.pdf

the Buffer looks like it owns memory and makes decisions about where that memory is at (heap) - for input to hash functions, that's not really necessary - a view will do just fine. for output, hashes are fixed-length.

if doing crypto functions instead (en/decrypt), it might make sense to have caller manage memory so as not to make decisions for them - that would mean an api that takes an openarray-like input and a c++-span-like output (or streams, though that can usually be built as a layer on top).

the same kind of api would be good for compression too, and a wide range of other use cases as others have noted.

zielmicha commented 6 years ago

There is some discussion about views/memranges here: #5753

It was settled that openarray[T] will have this role (#5957).

mratsim commented 6 years ago

Tagging @krux02 as he already requested C++ span/view/memrange/slice/openarray here https://github.com/nim-lang/Nim/issues/5437

In the wild:

C++ call that span,
Pascal/Oberon/Modula Open Array
Rust call that slice Slices are a view into a block of memory represented as a pointer and a length.

krux02 commented 6 years ago

@mratsim Thanks for the invite. I will now also add my two cents. I just started reading this thread, and yes I agree, a consistent hashing interface is important. I like the interface that Araq suggested. It looks a lot like the old hashing interface, but without the ugly hashing operator definitions. In my opinion the hashing operators are a bad example for operator overloading/definition.

I just don't agree with the suggested convenience template. It is called hashBytes but the argument already says that it hashes bytes, twice in the name of the argument and the type. So that part is redundant, but it returns a string that holds the hex representation of the input bytes, and that is not obvious from anything at all. So I would just call it hashToHexStr, and make it a procedure. But it was probably just meant as an example and discussing it's name is pointless.

The only case that I would like to throw into the pot to think about is to not make runtime branching on hashing algorithms easy. But with the proposed interface it should work nicely to implement something like this:

proc hash(hashKind: HashKind; bytes: openarray[byte]; dst: var openarray[byte]): string =
  case hashKind
  of ...

dom96 commented 6 years ago

Looking back at some of the older messages, in reference to @Araq's reply, I'm incredibly surprised that this is valid Nim code:

template toBytes*(x: string): seq[byte] = 
  cast[seq[byte]](x)

Is it really valid? Are sequences and strings really that similar?

My reasoning for preferring string is very simple: C compatibility. Isn't it the case that most C libraries return binary data in the form a char *?

I was under the impression that converting this data into a seq[byte] would have a significant overhead, but it appears that is not the case. So yeah... let's adopt this approach and make working with binary data easy (it's currently quite a PITA, ideally it would be nice to have something like Python's struct module in the stdlib... but I'm getting offtopic here).

mratsim commented 6 years ago

@krux02 will your case work at compile time? I would separate the HaskKind and the string conversion.

proc hash[H: HashKind](bytes: openarray[byte]): HashKind =
  when H is: ...

@dom96, byte = char (from C standard: a char and a byte are the smallest addressable unit of memory) and in 99% of the case char = uint8 (except the DSP i mentionned).

Edit: In the past openarray[char] used to match with string input and caused problem. They are indeed the same.

Also for C compatibility we have cstring and pointer.

krux02 commented 6 years ago

I already said it in the chat, but I will repeat it here as well. When you abuse a string as binary data. Be aware that you are in fact abusing the sting type. Never ever expose this abuse in the public API. And to make sure you never expose this abuse in the public API, just also don't use it internally.

And to prefere string for C compatibility, is wrong in several layers. First of all, when you warp a C library, provide a nice Nim wrapper. Please don't expose c types in the Nim API. If you do so you force everybody who uses that library to put the conversion at the usage side. And that is just not very nice. And then, string is not any less or more compatible with C than any other type in Nim.

arnetheduck commented 6 years ago

the name hash tends to be used for a non-cryptographic hash that's suitable for dictionaries - is there need for some distinguishment there?

mratsim commented 6 years ago

As a first step we released a byteutils package to deal with byte <-> hex conversion to and from seq[byte] and openarray[byte].

Also regarding @arnetheduck remark, we settled so far on the MDigest (Message Digest) type for crypto hashes defined in nimcrypto here.

Basically all hashes provide the following interface:

MDigest type (array of byte)
HashType / HashKind (sha2, sha3, AES ...)
Context for incremental hashing
- init
- update
- finish
HashType.digest(foo) for stateless/context-free hashing

type
  MDigest*[bits: static[int]] = object
    data*: array[bits div 8, byte]

  bchar* = byte | char

proc `$`*(digest: MDigest): string =
  result = ""
  var i = 0'u
  while i < uint(len(digest.data)):
    result &= hexChar(cast[byte](digest.data[i]))
    inc(i)

proc digest*(HashType: typedesc, data: ptr byte,
             ulen: uint): MDigest[HashType.bits] =
  mixin init, update, finish, clear
  var ctx: HashType
  ctx.init()
  ctx.update(data, ulen)
  result = ctx.finish()
  ctx.clear()

proc digest*[T](HashType: typedesc, data: openarray[T],
                ostart: int = 0, ofinish: int = -1): MDigest[HashType.bits] =
  mixin init, update, finish, clear
  var ctx: HashType
  let so = if ostart < 0: (len(data) + ostart) else: ostart
  let eo = if ofinish < 0: (len(data) + ofinish) else: ofinish
  let length = (eo - so + 1) * sizeof(T)
  ctx.init()
  if length <= 0:
    result = ctx.finish()
  else:
    ctx.update(cast[ptr byte](unsafeAddr data[so]), uint(length))
    result = ctx.finish()
  ctx.clear()

In the HashType.bits bits is a static[int] as shown below.

The definition of a HashType, for example sha3/keccak:

type
  KeccakKind* = enum
    Sha3, Keccak, Shake

  KeccakContext*[bits: static[int],
                 kind: static[KeccakKind]] = object
    q: array[25, uint64]
    pt: int

PMunch commented 6 years ago

For the byte format it would be nice to have a corresponding stream in the streams module as well. So you would be able to easily read data in from, and write data out to such a stream.

euantorano commented 6 years ago

Yes, a corresponding stream type would be really useful.

On 3 Apr 2018, at 08:18, PMunch notifications@github.com wrote:

For the byte format it would be nice to have a corresponding stream in the streams module as well. So you would be able to easily read data in from, and write data out to such a stream.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

PMunch commented 6 years ago

I created a ByteStream stream type in this PR. It's basically a copy-paste of the StringStream as they share a lot of semantics.

ehmry commented 6 years ago

If I may add my two cents, a hash function should support a stack result foremost. If something like a seq[byte] result is convenient, that should be implemented in a wrapper. A general hash API should also support variable output length, though not necessarily at runtime. The NIST has standardized the SHAKE functions for this purpose, and all SHA-3 variants should fit comfortably in the API.

mratsim commented 5 years ago

I've had a look into the crypto packages in stdlib (md5 and sha1). I don't see away to gradually make the change with a {.deprecated.} pragma.

The way forward would be to deprecate those packages altogether and either redirect to nimble packages or implement a revamped crypto stdlib with the desired API.

arnetheduck commented 5 years ago

fwiw, https://github.com/status-im/nim-stew is the new place for experimental API's that we produce (byteutils moved in there)

mratsim commented 3 years ago

A cryptographer and security auditor just made an article about the security issues caused by using strings instead of byte arrays: https://littlemaninmyhead.wordpress.com/2021/09/15/if-you-copied-any-of-these-popular-stackoverflow-encryption-code-snippets-then-you-did-it-wrong/

nim-lang / RFCs

RFC: Standardize API of all crypto/hashes packages #32

Context:

Stakes

Questions and considerations for API:

Opinions