jedisct1 / libhydrogen

A lightweight, secure, easy-to-use crypto library suitable for constrained environments.
https://libhydrogen.org
Other
594 stars 88 forks source link

Replace timer based RNG seed with ADC reference based one #116

Open DeeFuse opened 2 years ago

DeeFuse commented 2 years ago

Remove "Arduino.h" dependency Fixes #115

lonjil commented 2 years ago

I don't think that that function actually generates 8 bits of entropy, so it should probably do more than 32 calls. I tried 256 calls and it was still much much faster than the clock-based code, so a large number of calls definitely wouldn't be problematic. It might also be a good idea to skip the ad-hoc shifting and masking code used in that function and feed everything directly into the hashing function.

DeeFuse commented 2 years ago

Unfortunately I got no AVR board to test the entropy. I can imagine that the result of the AD conversion will fluctuate in a range 0+X < n < 1023 - Y thus the higher bits might not contribute much entropy. I think the masking and shifting should compensate for that. Instead of calling hydro_random_rbyte 256 times (and masking the result to one bit) I'd rather increase the number of nConversions inside this function. @jedisct1 Whats your opinion on the entropy quality? Some insight into the entropy quality can be seen on the bottom of this page

lonjil commented 2 years ago

When I tried it, I didn't mask hydro_random_rbyte down. I just let it feed the output directly to hydro_hash_update 256 times by changing ebits += 8 to ebits += 1.

DeeFuse commented 2 years ago

I had a look into the hydro_hash_update function, it seems that the input is XORed with the state and then fed into the gimli_core_u8 every gimli_RATE (16) bytes. I could see a benefit of creating a configuration macro like LIBHYDROGEN_RNG_ENTROPY_BITS instead of the hardcoded 256 bits in the hydro_random_init functions. That would allow to select the "quality" of the RNG initialisation depending of your needs (better quality vs faster initialisation)

jedisct1 commented 2 years ago

I agree that we should feed everything to the hash function instead of shifting&masking.

How many unbiased bits can we safely assume here?

With ebits += 1, the initialization procedure completes in 400ms on the Uno, which should be acceptable for most applications.

If you feel like it's not, ebits += 2 would remain safe with a single unbiased bit.

jedisct1 commented 2 years ago

That would allow to select the "quality" of the RNG initialisation depending of your needs (better quality vs faster initialisation)

Having an unsafe option is not a good idea.

DeeFuse commented 2 years ago

I can't give a definite answer on the number of unbiased bits from the code alone. I'll look if i can find an AVR arduino in my stash to generate a larger number of random numbers to have a dataset to work with.

lonjil commented 2 years ago

The Uno's serial USB connection is unfortunately very slow, but I have a few hundred KiB of output data so far. Testing with nConversions=20, nConversions=100, nConversions=1, and in a bit I'll change the code to remove the shifting and looping completely and try that.

DeeFuse commented 2 years ago

A basic test could be to simply create the average of all random uint8_t outputs of the generator function and observe if it converges towards 127. An other option would be to count the amount of each random number. The conversion loop (nConversion) has to be >= 2, otherwise the ADC trickery with the High/Low pulling of the reference and the resulting random sample interval would not work. IMHO the shift / mask can be omitted but not the conversion loops.

lonjil commented 2 years ago

That is true. Right now I'm testing it with a global instead so that it still pulls up and down for each call, without looping internally. Also, seems like the main slowdown was actually random_rbyte, not the serial interface, so I have a very large amount of data for N=1 and what I described in the previous sentence.

lonjil commented 2 years ago

Doing some more testing, any prescaler option other than 64 seems to produce fewer unique readings from the ADC. Also, regardless of settings, at least half of all readings are 0.

DeeFuse commented 2 years ago

I'd say that the 0 readings are expected, ever 2nd itteration the ADC gets grounded to be able to recharge it. we could optimize the code to ground and recharge in the same itteration.

lonjil commented 2 years ago

I generated 17MiB with this code

uint8_t
random_rbyte()
{
    uint8_t rnd = 0;

    ADMUX = 0b01001111; // Low (0V)
    ADMUX = 0b01001110; // High (1.1V)
    ADCSRA |= 1 << ADSC; // Start a conversion

    // ADSC is cleared when the conversion finishes
    while ((ADCSRA >> ADSC) & 1) {
        // wait for the conversion to complete
    }

    rnd = ADCL; // do not swap this sequence. Low has to be fetched first.
    rnd ^= ADCH; // the value is always between 0-3

    return rnd;
}

and got the following byte value distribution

      2 01000101
      4 01001001
    428 01001010
    295 01001011
      8 01010100
     40 01010101
      5 01010110
    328 01010111
    239 01011000
      1 01011101
      2 01011110
     11 01011111
     58 01100000
      4 01100001
     15 01100011
   1684 01100100
   2217 01100101
      9 01100110
      7 01101001
      9 01101010
      5 01101100
     68 01101101
    317 01101110
     68 01101111
    129 01110000
   3413 01110001
   2605 01110010
     10 01110011
      4 10010011
      5 10010100
     13 10011001
      9 10011010
     11 10011011
     10 10011100
      2 10100111
      5 10101000
     87 10101001
    647 10101010
  28555 10101011
 358204 10101100
1229554 10101101
1625192 10101110
 634598 10101111
 959201 10110000
1987364 10110001
1220261 10110010
 933451 10110011
1656613 10110100
2056008 10110101
1719886 10110110
1588002 10110111
 986552 10111000
 669690 10111001
 120899 10111010
   9067 10111011
  12170 10111100
   7023 10111101
  13495 10111110
  10785 10111111
  16736 11000000
   7771 11000001
   8332 11000010
   1009 11000011
    282 11000100
    198 11000101
     64 11000110
     59 11000111
     52 11001000
     44 11001001
      7 11001010
      2 11001011
jedisct1 commented 2 years ago

How frequently do you get the same value twice in a row? What distribution do you get with only the 2 or 4 last bits? And with a Von Neumann extractor?

lonjil commented 2 years ago

What kind of tools are good for this sort of analysis? I've just been stringing commands together in bash, but that's starting to feel quite inadequate. Looking at every individual pair of bytes, 158261 have the same byte twice in a row. Here's the actual file I generated https://0x0.st/-tyR.txt (it's raw binary data, not sure why 0x0 added the .txt) if you want to do your own analysis on it.

lonjil commented 2 years ago

Last 4 bits:

 976124 0000
1998552 0001
1231198 0010
 934489 0011
1658592 0100
2058465 0101
1719964 0110
1588391 0111
 986848 1000
 669845 1001
 121999 1010
  37930 1011
 370389 1100
1236646 1101
1639006 1110
 645462 1111

Last 2 bits:

3991953 00
5963508 01
4712167 10
3206272 11
DeeFuse commented 2 years ago

I'm doubt that 1 instruction is not enough to discharge the internal capacitor to have a sufficient long charge time and range for a good distribution

ADMUX = 0b01001111; // Low (0V)
ADMUX = 0b01001110; // High (1.1V)
lonjil commented 2 years ago

I changed it so that it sets it to low right at the end of the function, and sets it to high right before sampling. This seems to increase the number of unique readings.

lonjil commented 2 years ago

I generated 1.6GiB. It has 100 different distinct byte values. I'm currently writing a tool in Zig to extract and analyze it.

lonjil commented 2 years ago

I ran a Von Neumann extractor on just the lowest bit of each byte, and the output was pretty even according to ent:

Entropy = 7.984414 bits per byte.

Optimum compression would reduce the size
of this 31211520 byte file by 0 percent.

Chi square distribution for 31211520 samples is 673722.13, and randomly
would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 127.5149 (127.5 = random).
Monte Carlo value for Pi is 3.187330063 (error 1.46 percent).
Serial correlation coefficient is 0.001175 (totally uncorrelated = 0.0).

It also passes a few dieharder tests, but the size (30M) is not big enough for most of the tests. Doing an extraction on every bit of every byte gave quite a crappy result. Next I'm going to separately test each bit position in the same manner.

I'm currently running another arduino sketch that generates 4096 samples and then resets, to see if there are any right-after-boot corelations between runs.

lonjil commented 2 years ago

I have now done Von Neumann extraction on each bit position. Here are the sizes of each output file:

Bytes    MB   File             Output size / Input size
31211520  30M von_neumann_bit0 0.14922
23101440  23M von_neumann_bit1 0.11045
18657280  18M von_neumann_bit2 0.08920
18440192  18M von_neumann_bit3 0.08816
 1937408 1.9M von_neumann_bit4 0.00926
  962560 940K von_neumann_bit5 0.00460
  372736 364K von_neumann_bit6 0.00178
  323584 316K von_neumann_bit7 0.00155

According to ent [1] each of these files is high entropy. Combined with the decrease in file sizes, I interpret this as meaning that while the high bits are very biased, they are still quite decorrelated. This doesn't mean that they're independent, but I think it's a good sign. I think an iterated Von Neumann extractor might be useful to get more examinable data out of the raw data, but I'm not sure.

[1] https://paste.sr.ht/~lonjil/b6814c2139c2f55c9d37abece27aab44bc568823

jedisct1 commented 2 years ago

This is great!

How long does it take to extract 256 bits from the lower bit?

lonjil commented 2 years ago

On my phone so my numbers are inexact, but I believe it took around 1707 input bits for Von Neumann to produce 256 bits.

lonjil commented 2 years ago

As a point of reference regarding performance, this hydro_random_init [1] implementation takes ~517 milliseconds to run hydro_init(). 2048 samples, which I believe means 131 (?) or so invocations of the Gimli function. With 1024 samples, it takes ~268 milliseconds. 4096 samples ends up at ~1014 milliseconds.

[1] https://paste.sr.ht/~lonjil/4a2cd787141af107fb5ac3b7eb3914b9d2034eef

DeeFuse commented 2 years ago

I see that you removed the ADC converisontime "randomisation". With that in place the randomnes of the readings could be increased.

lonjil commented 2 years ago

I'm not sure it would. Changing the scaler to 2 makes it not random at all, and changing it to 128 makes it much slower while also reducing the entropy per reading (at a glance anyway, I didn't do a rigorous test). I think the other possible selections other than 64 all produced less entropy as well, so any extra randomness from an unpredictable ADC clock might just be lost from a more predictable sample output in some of the cases.

Also, the clock scaler needs to be set to a different value depending on the chip clock rate. An atmega328p at 8mhz should probably use 32, one at 4mhz should probably use 16, and so on, and one at 20mhz might do best at

  1. The random selection would make that aspect harder to reason about, and not possible at all on some configurations.
lonjil commented 2 years ago

I ran an implemention of NIST 800-90B non-IID min-entropy estimator on 2 million samples, and it gave 0.315568 bits per sample.

DeeFuse commented 2 years ago

I've tested the codes with a ATmega2561 in a comercial appliance I had at hand. Following porblems emerged:

  1. The V_BG ADMUX settings are different (compared to the ATmega328)
  2. External (strong) AREF pullups will result in identical readings every call (with @lonjil code as well as wiht the StackOverflow suggestion)
lonjil commented 2 years ago

Since different AVR micros differ so strongly on so many points, it may be necessary to have bespoke routines for every model :/

jedisct1 commented 2 years ago

This is sad :(

And we won't be able to test on every AVR model.

lonjil commented 2 years ago

I guess the code will have to have preprocessor directives for the target chip, and an error instructing people to open an issue for un-implemented models.

Also, I wonder if there is a published in-depth description of the ADC hardware anywhere, so that someone with the appropriate physics knowledge could figure out a lower bound on the min-entropy, rather than the upper bound we get from sample testing.

DeeFuse commented 2 years ago

I think we won't be able to determine a definite min-entropy bound for that type of source. After running the tests on my boards I think this aproach is not fitting for this librarys goar of beeing "...hard-to-misuse..." when it is possible to accidentially "disable" the random data source by applying a "common hardware design" TL;DR: I think we should not rely on external circumstances / hardware designs for a source of random data

I'm currently testing an other (WDT + Timer based) RNG which would be independent of external / user influencable circumstances. The randomnes here will come from the "low accuracy" of the internal 128kHz RC oscillator for the WDT.

Entropy = 7.957624 bits per byte.

Optimum compression would reduce the size
of this 4096 byte file by 0 percent.

Chi square distribution for 4096 samples is 239.62, and randomly
would exceed this value 74.71 percent of the times.

Arithmetic mean value of data bytes is 127.8352 (127.5 = random).
Monte Carlo value for Pi is 3.131964809 (error 0.31 percent).
Serial correlation coefficient is 0.031658 (totally uncorrelated = 0.0).
lonjil commented 2 years ago

@DeeFuse would you mind posting the code for that?

DeeFuse commented 2 years ago

@lonjil You can find the code I've used here (cleaned up) My test hardware was a ATmega2561 with F_CPU = 14745600UL

jedisct1 commented 2 years ago

Old PR and discussion, but still relevant :)

What should we do?

DeeFuse commented 2 years ago

I could clean up my GIST with the updated code and push it to my PR. I'm no security / crypto expert. How do you feel about the implementation of randomnet used in my code?

jedisct1 commented 2 years ago

The implementation looks fine, but I don't have the hardware to test it on.

DeeFuse commented 2 years ago

If you want to have a look into the randomnes of this implementation I can send you a bin file for analysis.

justind000 commented 2 years ago

The implementation looks fine, but I don't have the hardware to test it on.

I have a lot of different boards available and can help out if needed.

DeeFuse commented 2 years ago

I created a 10k number sequence as well as 10 shorter dumps after booting the MCU: random.zip I've done some reading about the AVR hardware differences which need to be accounted for to make it foolproof to use it. After implementing some #ifdef to catch those hardware differences I'll feel comfortable to push / merge the code into libhydrogen. @justind000 I've updated my gist with the latest code. Feel free to run it on your AVRs. The random numbers are printed to USART1 @ 115200 baud

DeeFuse commented 2 years ago

Shall we backup the states of the used perihpherals (WDT / TIM1) and restore them after the random number generation?

jedisct1 commented 2 years ago

Shall we backup the states of the used perihpherals (WDT / TIM1) and restore them after the random number generation?

That sounds like a good idea. Applications may not expect them to be changed.

DeeFuse commented 2 years ago

I got a question about the coding style: I can calculate the required timer parameters at compile time or at runtime. The main difference is that the code when compilet with -01 or higher will be equal but at -O0 there is a huge difference. The drawback of the precompiler calculation is the readability. Which solution shall I include? test1 - Preprocessor / test2 - runtime (or optimized) https://godbolt.org/z/z37E6ox6s

jedisct1 commented 2 years ago

Runtime.

DeeFuse commented 2 years ago

I'm thinking about disabling the function 'hydro_random_reseed' when building on an AVR unless the user explicitly enables it. If the user calls this function while running their own code it could cause a lot of problems due to the reconfiguration of TIM1 and the ~4second execution time.