lbryio / lbrycrd

The blockchain that provides the digital content namespace for the LBRY protocol
https://lbry.com
MIT License
2.57k stars 178 forks source link

Implement a hard fork to normalize claim names #159

Closed lbrynaut closed 5 years ago

lbrynaut commented 6 years ago

This is a complete re-write of #102 and replaces it since it's much more complete and handles situations that the other does not handle.

This PR contains a lot of changes and I expect a detailed review will take some time to ensure correctness. I also expect the reproducible_build script will need further modification (likely won't work as-is again due to the added ICU dependency).

@kaykurokawa The "claimtriebranching_hardfork_disktest" unit test does not seem to work with this PR, if you have a second to see what's going on there, it would help. It came up after things were working and then I rebased to latest, so it's commented out for now.

EDIT: @kaykurokawa This last comment is no longer true, so can be ignored now.

lbrynaut commented 6 years ago

Addresses #65

WIP regarding travis/dependency addition.

Updated.

lbrynaut commented 6 years ago

Updated.

kaykurokawa commented 6 years ago

I opened up a separate issue to discuss the issue of "how to do character normalization" here: https://github.com/lbryio/lbrycrd/issues/204 .

Seems like the method in this PR may not be optimal but we can discuss that separately since I think it will not affect the bulk of the changes here.

BrannonKing commented 6 years ago

My first attempt at compiling this branch gave me this error:

../libtool: line 7485: cd: auto/lib: No such file or directory
libtool:   error: cannot determine absolute directory name of 'auto/lib'

I am using system-installed boost (with the OpenSSL 1.1 PR) as I haven't been able to make the reproducible_build.sh work for me since July (on Bionic). After getting the above error, I validated that "auto" was the path for ICU, googled a bit, and read the configure.ac a bit. I decided to add this to the configure command:

--with-icu=`pkg-config icu-i18n --variable=prefix`

That did alleviate the issue. It seems that configure.ac is not correctly falling back to pkgconfig in the situation where --with-icu is omitted.

lbrynaut commented 6 years ago

That did alleviate the issue. It seems that configure.ac is not correctly falling back to pkgconfig in the situation where --with-icu is omitted.

Hrm, odd. It should be a separate case in the configure. I was focused on getting the reproducible builds going because we don't want boost or icu shared linkage (at least for the distributed binaries), as it's error prone and system dependent.

lbrynaut commented 6 years ago

@BrannonKing reported a performance issue, and it's true, this is slower than it needs to be. Correctness first, then performance is how I've always operated ;-)

Anyway, I've already got a partial fix, but will be updating when I think it's ready.

BrannonKing commented 6 years ago

My experience with this branch thus far (with fork height at 1M):

  1. I originally ran it with my existing mainnet data (including txindex). I was told that I needed to reindex.
  2. I let the reindex run over the weekend; it was soon obvious that it needed a lot of time. It took 50+ hours.
  3. I stopped the server so I could set the fork block height, recompiled, and reran. I was again told that I needed to reindex.
  4. I then fired up a lbrycrdd build from a different branch. It crashed with this error: main.cpp:2124: bool DisconnectBlock(const CBlock&, CValidationState&, const CBlockIndex*, CCoinsViewCache&, CClaimTrieCache&, bool*): Assertion 'pindex->GetBlockHash() == trieCache.getBestBlock()' failed.
lbrynaut commented 6 years ago

My experience with this branch thus far (with fork height at 1M):

@BrannonKing I dusted off the webs, please test and find more (fork related) issues.

BrannonKing commented 6 years ago

The reindex issue is resolved. However, I was unable to run past the fork height. I modified the fork height to occur in the next few minutes. However, it never went past the block height one before my fork height. I waited about 20min. This is on mainnet. I'm seeing a lot of this error:

EXCEPTION: 15dbwrapper_error       
Database I/O error       
lbrycrd in ProcessMessages()
kaykurokawa commented 6 years ago

The reindex issue is resolved

Yes, same for me. Previous PR told me I needed reindex when starting with mainnet data.

I modified the fork height to occur in the next few minutes. However, it never went past the block height one before my fork height. I waited about 20min. This is on mainnet.

Brannon that sounds like expected behavior, because the fork happened on your codebase but not on mainnet.

BrannonKing commented 6 years ago

If I try to run on mainnet with a fork height that is past I crash with this error:

EXCEPTION: 15dbwrapper_error       
Database I/O error       
lbrycrd in AppInit() 
kaykurokawa commented 6 years ago

I think we should discuss what happens to bytes that are invalid utf-8 . Because utf-8 is variable width, not all bytes are valid utf-8. It seems like the normalization function we have now will parse invalid bytes without throwing an error. I'm not sure exactly what the logic is but if I try passing in a byte that is invalid like 0xFF to the normalization function, it returns back 0xFF because it hits this below line in the normalization function:

return (normalized.size() >= name.size()) ? normalized : name;

The variable "normalized" was actually empty when I inspected it. I create this unit test gist, for testing purposes, you can check it out: https://gist.github.com/kaykurokawa/2c35843eb09da2bc8f31ffac46de4099

So two questions. First question is whether we should attempt to normalize invalid bytes at all, perhaps we should reject them from ever entering the claimtrie? Second question is if we are normalizing invalid bytes, than is this current method correct (is it possible that invalid bytes could be accidentally normalized into valid bytes?)

BrannonKing commented 6 years ago

If our standard is UTF-8, we should reject anything that is not valid UTF-8. That's my vote.

BrannonKing commented 6 years ago

I tried to test this in a simple fashion this morning with no luck:

  1. Remove ~/.lbrycrd/regtest
  2. Compile with regtest normalization fork height at 20.
  3. Run the daemon: ./src/lbrycrdd -server -txindex -regtest -walletbroadcast=0
  4. Generate 30 blocks: ./lbrycrd/src/lbrycrd-cli -regtest generate 30
  5. Get this error:
    error code: -1
    error message:
    Database I/O error
lbrynaut commented 6 years ago

I tried to test this in a simple fashion this morning with no luck:

1. Remove `~/.lbrycrd/regtest`

2. Compile with regtest normalization fork height at 20.

3. Run the daemon: `./src/lbrycrdd -server -txindex -regtest -walletbroadcast=0`

4. Generate 30 blocks: `./lbrycrd/src/lbrycrd-cli -regtest generate 30`

5. Get this error:
error code: -1
error message:
Database I/O error

@BrannonKing Thanks, give it a go again, this is corrected.

lbrynaut commented 6 years ago

So two questions. First question is whether we should attempt to normalize invalid bytes at all, perhaps we should reject them from ever entering the claimtrie? Second question is if we are normalizing invalid bytes, than is this current method correct (is it possible that invalid bytes could be accidentally normalized into valid bytes?)

@kaykurokawa Good questions. There are few issues that need addressing:

1) Normalization can fail 2) Lower-casing can fail [ Generically, we call both steps normalization in the code ]

The general approach I was going for is that if either fails, fall-back to our current behavior (which is to just treat the bytes as bytes in the claimtrie). Maybe that was a bad assumption on what we need, or maybe it's not working as intended (although it appears to be to me).

Glad we're finally doing this kind of review though because I did expect this process to take some time and we're getting to the hard parts. Open to suggestions. I'm not sure rejecting any non-UTF8 inputs is the correct solution, as it does place a larger burden on the applications using lbrycrd, who really may not care about this feature.

Either way, I think we can all agree that consistency is the most important, in that either the same byte or code-point sequences are stored consistently in the claimtrie.

BrannonKing commented 6 years ago

I'm getting close to having a solution that doesn't require keeping two tries in memory. As part of that work, I've been narrowing in on two rules:

  1. You cannot normalize a string inside of a method that takes a partial key; you run the risk of losing data.
  2. You cannot normalize inside of a claimtrie's method called from claimtriecache; they aren't on guaranteed to be on the same block height. (It's possible that we could make this work by careful locations for turning the normalization handling in the claimtrie on/off, but I haven't seen a need for it yet.)
BrannonKing commented 6 years ago

@bvbfan , excellent analysis. With our work on #44 it appears that we achieved independence from methods taking a name in rpc/claimtrie.cpp. I think that is critical to your suggestion that we don't do string normalization in the CClaimTrie (instead, we do it all in the "cache"). I really like the suggestion! I think that puts a high priority on getting #44 merged and rebasing our normalization on top of that.

BrannonKing commented 5 years ago

I did finally get a green on the Linux compilation again. Some notes on my issues with the reproducible_build.sh:

  1. The ICU_PREFIX was being exported to the child makefiles but wasn't always set to a real value when relying on pkg-config. This was causing issues. I changed the child makefiles to rely wholly upon ICU_LIBS and ICU_CPPFLAGS, which should be always set.
  2. I could not get ICU v55 to compile statically. I upped it to v57. Our current version of Boost (v59) would not compile with newer versions of ICU.
  3. The script sets "-e", or maybe it's "-u", which makes it exit whenever any command returns nonzero. It also sets pipeline, which makes it keep the worst result from any method. The flag handling in the wait method was always returning 1 on my machine, causing the whole script to exit. I ditched the flag usage there in favor of a sub-terminal.
  4. The parameters for the boost b2 command weren't getting grouped appropriately. I put in quotes. This made it so that I could not make that call from the "background" function. I'm hoping someone else has a solution for that. The background function seems to run it fine, it just doesn't find the ICU deps when called from there. I could not figure that out. I did add a grep check to ensure that the ICU deps are found.
  5. Because I'm now compiling Boost without the background minute message, TravisCI times out at 10 minutes when compiling Boost afresh. I don't have a solution for this yet. It wants output every 10 minutes. It's ridiculous that the timeout is not configurable. Instead, we push the output to a log file for each dependency build. I'm not sure that's actually a desirable feature. The other side affect of that is how we hide any error reported by our script with a postdump of 200 or 1000 lines of the log file.
  6. The TravisCI caching keeps the build folder around. Our script was checking to see if the dependency's parent folder existed there. If it did exist, it would not run make on that dependency. Hence, if you fail on a dependency it won't get rebuilt because of the TravisCI cache. Not only that, but build/boost was always there -- even if I deleted the cached file from TravisCI. I had to change the script to always build dependencies, which is fast for all except openssl who thinks it's cool to rewrite a whole bunch of source code files with every configure call.
  7. The OSX build now fails with this error: "Too many levels of symbolic links" while compiling openssl a second time. I don't have a solution for that yet.
  8. Statically linking in ICU requires fPIC. I had to add that to the build of ICU.
  9. The "clone" flag that didn't do anything is now gone.
  10. The reproducible_build originally used pkg-config to find ICU, but it would possibly return a system installation. It's now hard-coded to the one pulled by the script.

The advantage of Docker is that you don't have to recompile the dependencies every time. We can stop wasting time on that. I feel strongly like we need to move that direction.

BrannonKing commented 5 years ago

235 supersedes this PR.