n-y-z-o / nyzoVerifier

Verifier for the Nyzo cryptocurrency
The Unlicense
73 stars 42 forks source link

Debug traces and Fix Sentinel Bug #39

Open EggPool opened 3 years ago

EggPool commented 3 years ago

This is my sentinel Fix I intended to clean up before proper release.

Since 608 came out, I'm pushing this to give feedback on what I found, but I will clean up more before asking for a possible merge. I also wanted to keep track of the various debug traces I added, that could prove useful for others in a similar context.

I'll edit further on giving more info on sensible points and discoveries.

Note that I'm not a java dev, so I'm not pretending my adjustments are optimum nor state of the art code. The fix however is proved to work on real world sentinels that were otherwise regularly freezing.

EggPool commented 3 years ago

A quick overview of the changes:

.gitignore : just to avoid uploading my ide settings

BalanceListManager.java

Left several traces commented out, for my reference, will be cleaned up.

Block.java

BlockManager.java

Moreover, this call was failing silently.
The refactor:

Message.java

Just a personal preference, log to std instead of err

Version.java

Self explicit.

SentinelController.java

Orange "uncertain" status covered 2 different causes. Made them explicit.

1/2, next one will cover Sentinel.java

EggPool commented 3 years ago

Changelog 2/2

Sentinel.java

Latest comment will be a tl;dr of the core sentinel fix

EggPool commented 3 years ago

What happened to failing sentinels since many versions:

EggPool commented 3 years ago

Last note:

I believe the latest 608 sentinel version fixes the bug in an indirect way: the code now requires the block signature to be valid before freezing the block.
Since this involves rebuilding the block buffer - hence tx - , this should catch the wrong data and not freeze them.

I do thing however most of my tweaks here are valid and could be merged, even if some are redundant.
Most added logs are not critical, however:

The reason why the block data is invalid - sometimes, on some sentinels - still needs to be dug, there may be a deeper hidden issue.

n-y-z-o commented 3 years ago

There's a lot to digest here. Your work is impressive, and you have a much deeper understanding of this issue than we do.

One observation that might be worth mentioning at this point: it appears that certain sentinels would have this problem frequently, while others did not experience it at all. If this was the case, it would suggest that the problem was a state issue on the sentinel. This means that it is likely that the blocks were sent correctly, but the missing transactions were stripped out by the sentinels upon block decoding.

EggPool commented 3 years ago

Thanks! I was lucky to have access to fast-failing sentinels, that helped hunt part of the issue.

While I have no clue where precisely this happens, I'm also inclined to think the issue is on the sentinel side of the pipe. I had setups with several sentinels tracking the same verifiers, or some overlapping sets.

When logging for wrong blocks, I first thought this could be because of layer-2 safety, and rogue nodes sending wrong data on purpose. But logs showed the wrong blocks came from my own in-cycle. I saw some logs where 3 consecutive in-cycle did "send" the same wrong block. Querying the verifiers from the python client just afterward did give the full correct block, but the sentinel saw 3 wrong blocks in a row. (3 blocks, very close - no other logs in between) No other sentinel, tracking the same verifiers, did freeze. Also, 606 versions were less stable than 595 for instance.

In my setup, I have some sentinels that never got stuck, and some very frequently (overlapping set). For my last tests, I used a fast-failing sentinel on 595 (failed once every 2 days on average) and updated it to 606. It was then failing from 20 min to 2 hours max.

So yes, I think we can safely assume it's not a verifier issue, but closer to the sentinel. Maybe some edge case related to managed verifiers threads and a race condition, network latency or cpu load. I'm thinking thread because It seems that several blocks received almost same time always fail together, while same block just slightly after are ok.

Since the balance hash check was effective, I aimed to release that fix first of all, and go on digging further for the root cause, with my very limited java experience. My setup, while not freezing anymore, still logs wrong blocks, so I can provide or add more traces from it if needed.

EggPool commented 3 years ago

Can confirm this is within validTransactions() Added more traces:

    private static List<Transaction> validTransactions(List<Transaction> transactions, long startTimestamp) {

        List<Transaction> validTransactions = new ArrayList<>();
        Set<ByteBuffer> signatures = new HashSet<>();
        long endTimestamp = startTimestamp + Block.blockDuration;
        for (Transaction transaction : transactions) {
            if ((transaction.getType() == Transaction.typeCoinGeneration || transaction.signatureIsValid()) &&
                    transaction.getTimestamp() >= startTimestamp && transaction.getTimestamp() < endTimestamp) {

                ByteBuffer signature = ByteBuffer.wrap(transaction.getType() == Transaction.typeCoinGeneration ?
                        new byte[FieldByteSize.hash] : transaction.getSignature());
                if (!signatures.contains(signature)) {
                    signatures.add(signature);
                    validTransactions.add(transaction);
                } else {
                    System.out.println("X:TX mismatch:dup " + transaction);
                }
            } else {
                System.out.println("X:TX mismatch:ts=" + startTimestamp+ ", "+ transaction);
                if (!transaction.signatureIsValid()) {
                    System.out.println("X:TX mismatch: Invalid sig");
                }
                if (transaction.getTimestamp() < startTimestamp) {
                    System.out.println("X:TX mismatch: ts <");
                }
                if (transaction.getTimestamp() >= endTimestamp) {
                    System.out.println("X:TX mismatch: ts >=");
                }
            }
        }

        return validTransactions;
    }

Logs filtered out |grep mismatch:

X:TX mismatch:ts=1620030664000, [Transaction:type=2,timestamp=1620030664000,sender=9dad...eea4,receiver=0000...0000,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch:ts=1620030664000, [Transaction:type=2,timestamp=1620030664000,sender=9dad...eea4,receiver=0000...0000,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch:ts=1620030664000, [Transaction:type=2,timestamp=1620030664000,sender=9dad...eea4,receiver=0000...0000,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch:ts=1620030664000, [Transaction:type=2,timestamp=1620030664000,sender=9dad...eea4,receiver=0000...0000,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:!!BLHash mismatch 5002...25c3 vs block [Block: v=2, height=11890552 txcount=1, hash=2192...0b2b, id=9dad...eea4, blh=5394...0d39]
X:!!BLHash mismatch 5002...25c3 vs block [Block: v=2, height=11890552 txcount=1, hash=2192...0b2b, id=9dad...eea4, blh=5394...0d39]
X:!!BLHash mismatch 5002...25c3 vs block [Block: v=2, height=11890552 txcount=1, hash=2192...0b2b, id=9dad...eea4, blh=5394...0d39]
X:!!BLHash mismatch 5002...25c3 vs block [Block: v=2, height=11890552 txcount=1, hash=2192...0b2b, id=9dad...eea4, blh=5394...0d39]
EggPool commented 3 years ago

Added more verbose tx logging (signature as well as data) and log in SignatureUtil.java signatureIsValid()

Next event is

X:TX mismatch:ts=1620035564000, [Transaction:type=2,timestamp=1620035564000,sender=faa6...744c,receiver=cb13...f7a3,data=6d65...6572,sig=a34a...cb07,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch:ts=1620035564000, [Transaction:type=2,timestamp=1620035564000,sender=faa6...744c,receiver=cb13...f7a3,data=6d65...6572,sig=a34a...cb07,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:!!BLHash mismatch bcc0...af3f vs block [Block: v=2, height=11891252 txcount=1, hash=e1d4...2215, id=faa6...744c, blh=7c56...3ffe]
X:!!BLHash mismatch bcc0...af3f vs block [Block: v=2, height=11891252 txcount=1, hash=e1d4...2215, id=faa6...744c, blh=7c56...3ffe]

The fail is not because of the exception in the method (would have been logged).
Means it's in the signature.verifyOneShot() call. Logging the signature object when failing shows "Signature object: NONEwithEdDSA

Tx logs show the tx has proper sig and data, so would point to the "signature" object itself (edDSAEngine) being wrongly initialized, or being corrupted by another thread before entering the synchronized portion.

Trying now to synchronize the whole method instead of just the current block. May not be best thing to do for production, but will validate/invalidate a thread related issue.

EggPool commented 3 years ago

I hope the flood is not an issue.
Since this PR is open, as well feed it live so everyone can follow and benefit from logs.

No more luck with synchronized signatureIsValid() method.

X:TX mismatch:ts=1620039764000, [Transaction:type=2,timestamp=1620039764000,sender=d025...da72,receiver=58a9...056a,data=6d65...6572,sig=fee2...f003,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch:ts=1620039764000, [Transaction:type=2,timestamp=1620039764000,sender=d025...da72,receiver=58a9...056a,data=6d65...6572,sig=fee2...f003,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch:ts=1620039764000, [Transaction:type=2,timestamp=1620039764000,sender=d025...da72,receiver=58a9...056a,data=6d65...6572,sig=fee2...f003,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch:ts=1620039764000, [Transaction:type=2,timestamp=1620039764000,sender=d025...da72,receiver=58a9...056a,data=6d65...6572,sig=fee2...f003,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch:ts=1620039764000, [Transaction:type=2,timestamp=1620039764000,sender=d025...da72,receiver=58a9...056a,data=6d65...6572,sig=fee2...f003,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:!!BLHash mismatch 106e...1d46 vs block [Block: v=2, height=11891852 txcount=1, hash=c6c9...b149, id=d025...da72, blh=1392...b2db]
X:!!BLHash mismatch 106e...1d46 vs block [Block: v=2, height=11891852 txcount=1, hash=c6c9...b149, id=d025...da72, blh=1392...b2db]
X:!!BLHash mismatch 106e...1d46 vs block [Block: v=2, height=11891852 txcount=1, hash=c6c9...b149, id=d025...da72, blh=1392...b2db]
X:!!BLHash mismatch 106e...1d46 vs block [Block: v=2, height=11891852 txcount=1, hash=c6c9...b149, id=d025...da72, blh=1392...b2db]
X:!!BLHash mismatch 106e...1d46 vs block [Block: v=2, height=11891852 txcount=1, hash=c6c9...b149, id=d025...da72, blh=1392...b2db]

An alternative cause could then be a corrupted identifierToSignatureMap (however now synchronized as well since adds/removes are in the synchronized method). I'll need to dig how the edDSAEngine works to move on.

I still think threading has something to do with it, since that's a difference between verifiers (sig checks are sequential) and sentinel (one thread per managed verifier, // sig checks)

EggPool commented 3 years ago

From what I can see so far, failing then successful sig verifications get the same signedbytes and same signaturebytes as input, as well as same verifier array.

So this would point to the "signature" edDSAEngine object, outside of Nyzo codebase.

EggPool commented 3 years ago

Going on, got rid of the identifierToSignatureMap to make sure, creating a new edDSAEngine for every sig check in the synchronized method.

Still invalid sigs.
I'll need to further dig signedBytes buffer and compare with final txs to make sure if this a crypto engine or data issue.

n-y-z-o commented 3 years ago

Your debugging here is really impressive. It makes a lot of sense that the failure is in transaction signature validation, though I hadn't even considered it.

Also, your removal of identifierToSignatureMap and creation of a new object for each signature was an excellent, thoughtful next step.

I agree that this does point to a possible thread contention issue in the EdDSA library. The fact that we didn't see this problem almost at all further suggests this possibility, as we have never managed more than 10 verifiers per sentinel.

Nyzo currently uses 0.2.0 of the EdDSA library (compile('net.i2p.crypto:eddsa:0.2.0') in build.gradle). We specified a fixed version instead of allowing upgrade to the latest version for stability and security purposes. However, the latest version is 0.3.0, and the Important changes of this version includes: The library has been extensively profiled for contention issues in a multi-threaded environment. The only remaining potential contention is in EdDSANamedCurveTable.defineCurve(), which will be rarely called (https://github.com/str4d/ed25519-java).

It's quite possible that this problem is manifesting all over the code, but other instances all have protections that simply discard data and cause a silent failure. As transaction volume scales up in the system, a problem of false negatives in signature validation could be disabling for the system, so this is certainly worth tracking down.

n-y-z-o commented 3 years ago

Here's a script I wrote to test whether multithreaded signature verification, on its own, was a problem. So far, my tests have not revealed any false-invalid signatures. Appropriate import statements are needed for this, but it should use only classes available in the Nyzo Java codebase.

public class SignatureThreadingTest {

    private static final int numberOfKeys = 100;
    private static final int dataBufferLength = 512;
    private static final int numberOfThreads = 100;
    private static final int maximumIterationsPerThread = 10000;

    public static void main(String[] args) {

        // Make the data buffers. Sign each one with a different key.
        List<byte[]> identifiers = new ArrayList<>();
        List<byte[]> dataBuffers = new ArrayList<>();
        List<byte[]> signatures = new ArrayList<>();
        Random random = new Random();
        for (int i = 0; i < numberOfKeys; i++) {
            byte[] key = KeyUtil.generateSeed();
            byte[] dataBuffer = new byte[dataBufferLength];
            random.nextBytes(dataBuffer);
            byte[] signature = SignatureUtil.signBytes(dataBuffer, key);

            identifiers.add(KeyUtil.identifierForSeed(key));
            dataBuffers.add(dataBuffer);
            signatures.add(signature);
        }
        LogUtil.println("generated keys and signatures");

        // Start threads for signing. Each thread will continue until a bad signature is found or the maximum number of
        // iterations is reached.
        AtomicInteger totalIterations = new AtomicInteger(0);
        AtomicBoolean badSignatureFound = new AtomicBoolean(false);
        AtomicInteger numberOfThreadsProcessing = new AtomicInteger(numberOfThreads);
        for (int i = 0; i < numberOfThreads; i++) {
            new Thread(new Runnable() {
                @Override
                public void run() {
                    for (int i = 0; i < maximumIterationsPerThread && !badSignatureFound.get(); i++) {
                        // Verify a random signature.
                        int index = random.nextInt(numberOfKeys);
                        boolean signatureIsValid = SignatureUtil.signatureIsValid(signatures.get(index),
                                dataBuffers.get(index), identifiers.get(index));
                        if (!signatureIsValid) {
                            badSignatureFound.set(true);
                        }

                        // Increment total iterations. Display progress periodically.
                        int currentIterations = totalIterations.incrementAndGet();
                        if (currentIterations % 1000 == 0) {
                            LogUtil.println("iterations: " + currentIterations);
                        }
                    }

                    numberOfThreadsProcessing.decrementAndGet();
                }
            }).start();
        }

        // Wait for all threads to complete.
        while (numberOfThreadsProcessing.get() > 0) {
            ThreadUtil.sleep(1000L);
        }

        // Show the total number of iterations and whether a bad signature was found.
        System.out.println("total iterations: " + totalIterations);
        System.out.println("bad signature found: " + badSignatureFound.get());
    }
}
EggPool commented 3 years ago

Thanks!

On my side, I'll try to log full data of failed signatures (data+sig) from sentinel, then compare with the same transactions fetched for a node, to make sure the buffer to sign is not corrupted. So far I only checked with a partial hex view, not full bytes.

May take some days given my current work load.