maidsafe / safe_network

Autonomi combines the spare capacity of everyday devices to form a new, autonomous, data and communications layer of the Internet
http://autonomi.com
80 stars 47 forks source link

bug (regression): updating a register can fail on local network with stable-2024.08.2.3 but still works using stable.2024-07-25 #2077

Open happybeing opened 2 months ago

happybeing commented 2 months ago

I have a bunch of scripts which I use to test my application (awe) on a local network. In brief, these create a local network and then upload a series of websites, some just a single version, but two involve uploading a series of about four versions.

When uploading multiple versions the following sequence is repeated to load and update the register:

(The above all happens in awe_website_versions.rs).

The scripts and code to do the above have been run tens if not >100 times without ever seeing the following error, which is happening when I try to update the second multi-version website, but not the first!

When the error occurs (and it occurs at the same point in repeated runs of these scripts with a new local network every time), the write_merging_branches_online() function fails with:

Failed to add XorName to register: Network(GetRecordError(RecordDoesNotMatch(c7ec9c(754072be1f575e7b94b97f21556067218383dd627d570314e1357d910f9592e9))))

If I use stable-2024.08.2.3 the above happens every time. If I use stable.2024-07-25 this has never happened.

Below are the safe_network crate versions I'm building against in each case:

# Generated using: awe-dependencies --branch stable-2024.08.2.3
sn_cli = { version = "0.94.1" }
sn_client = { version = "0.109.1" }
sn_peers_acquisition = { version = "0.4.2" }
sn_registers = { version = "0.3.17" }
sn_transfers = { version = "0.18.10" }
sn_protocol = { version = "0.17.7" }

# Generated using: awe-dependencies --branch stable.2024-07-25
# sn_cli = { version = "0.94.0" }
# sn_client = { version = "0.109.0" }
# sn_peers_acquisition = { version = "0.4.1" }
# sn_registers = { version = "0.3.16" }
# sn_transfers = { version = "0.18.9" }
# sn_protocol = { version = "0.17.6" }

What is strange to me is:

For all five Registers, including the three other Registers, I create the register and immediately write two values. This always succeeds. There's only an error in one of the two Registers I subsequently try to write a third value too, and it is always the same one.

maqi commented 2 months ago

2024.08.2.3 contains a breaking change for client to get Register from network, which is not supported by the current PROD-01. you will have to wait for all nodes got updated to get it supported.

currently close this issue as it is not relevant.

happybeing commented 2 months ago

Thanks Qi, I understand but it is still an issue until it's fixed. Keeping it open allows others to find it.

Right now I understand very few people will be hitting this, but I think the point is important and it's not a good idea to close an issue unless there's another place where someone else having the same problem will be able to find out why.

It's also relevant in that it highlights another problem, that a few weeks before launch release are going out with issues if this kind.

I also expect you are under pressure to close issues as soon as possible because of the desire to see the end according to the plan. If so that's a mistake imo.

Thanks for your work @maqi, it is reassuring to that you are involved in these very tricky areas. 🙏

loziniak commented 2 months ago

@happybeing , do you have this problem, when client and local nodes are built from the same version? because as I understand, @maqi 's answer suggests, that the issue is from incompatible versions.

happybeing commented 2 months ago

It's not incompatible versions, it is a breaking change in that the register crate is ahead of the node crate since the recent update.

maqi commented 2 months ago

Hi, @happybeing

so first, I was in a rush this morning, so judged the issue and comment with partial understanding of your original question.

I now get the issue re-opened, as it might does show some new issue.

Meanwhile, it does help to confirm the issue by launch a new local testnet with all nodes upgraded to 2024.08.2.3, and run awe app with the same 2024.08.2.3, to see if the problem is reproducable. Thank you very much.

maqi commented 2 months ago

also @happybeing ,

is that possible when you hit that error of Failed to add XorName to register: Network(GetRecordError , you can collect : 1, the update history of the local register you have 2, do a Client::get_register(), and show its update history as well

The update history is the tree structured diagram that you shown at https://github.com/maidsafe/safe_network/issues/2030#issuecomment-2299264585

Thank you very much

happybeing commented 2 months ago

I'm confused Qi.

If this is just a matter of waiting until the node catches up with the register crate, what's the purpose of your requests?

maqi commented 2 months ago

sry, @happybeing ,

my original judgement of the issue was incorrect, which might gave you wrong impression that this is an issue of mis-matched version between nodes' and clients'.

As you mentioned, you used local testnet, which first client and node shall always use the same version, and second 2024.08.2.3 shall not contain any breaking change even have client use this version and nodes retain with old version.

Hence I suggested you to restart you local testnet to make sure client & nodes using same version of 2024.08.2.3

happybeing commented 2 months ago

Thanks Qi.

I don't need to redo anything because I know the client and testnet were both built to the same version.

I don't know if I will be able to assist you further as I'm stepping back for a while at least.

It was great to work with you briefly.

maqi commented 2 months ago

Hi, @happybeing,

Thx for the clarification info. If the client and testnet are always built with the same version, then I will check the other possiblities.

Really appreciate your contributions, and I also feel great to work with you as well. :)

happybeing commented 1 month ago

If the client and testnet are always built with the same version, then I will check the other possibilities.

I generate the crate versions from the safe_network crate using a script that takes the relevant tag, checks it out and generates the deps for my app's Cargo.toml from the safe_network Cargo.lock. The output of that is included in the OP.

Good luck.

happybeing commented 1 month ago

Here's a note to confirm that I have been running my tests successfully against stable.2024-07-25 many times since filing this issue.

I have just tried today's new release stable-2024.09.1.3 and can confirm that the issue described in the OP remains, and confirms that there has been a regression since stable.2024-07-25.

Below is the extract from my testnet-full script log. testnet-full starts a local testnet and then uses awe to upload and subsequently update several websites. Uploading creates a register and writes two entries to it. Updating a website retrieves the register and attempts to write another entry to it, and it is at this point that the error occurs - but not for every attempt to update a website.

As described in the OP, the error does not occur every time a website is updated, but appears to happen at the same point in the test script every time (which I find surprising and may be a useful clue).

Updating versions register 07a0da3efbd66d05582c98e20e2ba092c051cb88305e4ed6b5623c30d67a4f80aff1389d71cddeae8af667f08ebaa4ba91ab123b2d7c7d6647a03c9213df6850346adbf387e477fcdcbe382695c4af11
VersionsRegister::sync() - this can take a while...
VersionsRegister::sync() - ...done.
VersionsRegister::sync() - this can take a while...
VersionsRegister::sync() - ...done.
Failed to update website version: Failed to add XorName to register: Network(GetRecordError(RecordDoesNotMatch(2c1f97(8883a1665e21c08d50611c7629fb8acc4ae35a8b29bf599d9c9da5b1d8cb1cf1))))

Location:
    src/awe_website_versions.rs:410:28
maqi commented 1 month ago

Hi, @happybeing,

thx for the info supplied, really helpful.

I think here is why you are hitting this error:

This explains why the RecordDoesNotMatch error does not occur every time, but appears to happen at the same point in the test script, because the mismatch happens with higher chance when more ops undertaken.

It will be much helpful and appreciated, if you can tweak the line at https://github.com/maidsafe/safe_network/blob/main/sn_client/src/register.rs#L845 to be

        let verification_cfg = GetRecordCfg {
            get_quorum: Quorum::One,
            retry_strategy: Some(RetryStrategy::Quick),
            target_record: None,
            expected_holders,
        };

you only need to rebuild your awe with this tweaked code. the local testnet can retain there untouched.

happybeing commented 1 month ago

Thanks @maqi. That solved the issue using local testnet so I'm building a new release of awe with this change in the local safe_network/sn_client`. Thank you! :clap:

maqi commented 1 month ago

I shall thank you for helping us verify/pin this issue. really appreciated

maqi commented 1 month ago

Here is PR https://github.com/maidsafe/safe_network/pull/2103 trying to address this as a formal fix.

happybeing commented 1 month ago

@maqi I'm still seeing two problems related to updating registers, similar to the issue in the OP (problem 1), and another previous issue where I see different versions of a register at different times (problem 2). The behaviour has changed though in both cases.

I am still using the 'manual' patch which you suggested above to build my client. That patch appeared to fix this issue on a local network, but I am now testing against the public network with my client built using stable-2024.09.1.3 (with the patch mentioned).

Problem 1. Registers still not updating. The first issue is that registers are still failing to reflect changes although I am not getting the error described in the OP. I've seen this happen twice with different registers, each time created to store the awe-some-sites website. What happens is that the register is created, two entries are written and merged immediately, leaving a total of two entries. Not long after I wrote a third entry but it was not reflected when I accessed the register which continued to show only two entries for at least ten minutes. I left this and came back hours later to find that the register was now showing 3 entries and this remained the case each time I accessed it.

I wasn't sure how long it took to reflect the change so I set up a query command to check the register size every five minutes and wrote a fourth entry. After 24 hours the register is still showing only 3 entries. I tried adding another entry later the same day - again without error - and today it is still showing only 3 entries.

The API indicates that the entries are written successfully every time, unlike the situation in the OP where an error is reported and the update does not happen.

All the above operations involved running my client on a VPS (creation, writing entries and then querying to see the number of entries every 5 minutes).

Problem 2. Register returns different numbers of entries. I've only seen this happen once so it is much less frequent than previously. While testing problem 1, I occasionally tried accessing the same register from my laptop (over mobile broadband) and once it returned the register but with only two entries.

You can see the status of the register live on the network yourself using awe inspect-register. The output below includes the 'audit' option which displays the register structure and shows that it only contains 3 nodes, one for each of the three entries:

$ awe inspect-register -ramd --include-files -e 1: a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba99004a0a4017bc2147cbe3d2c882c1308afa00cc
Autonomi client initialising...
Connecting to the network using 25 peers
register    : a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba99004a0a4017bc2147cbe3d2c882c1308afa00cc
owned by    : PublicKey(1759..b88c)
permissions : Writers({PublicKey(1759..b88c)})
app reg type: 5ebbbc..
size        : 3
audit       :
   current state is merged, 1 value:
   5fb227c3d914852bb7731a55f5feb1e4854ce007b1201e617597d6fb080362b0
entries 1 to 2:
entry 1 - fetching metadata at d05b3f5e8c085c8f3046d5859b627fb7de5c773c26cffcfcbc75364971121a90
DEBUG get_website_metadata_from_network() at d05b3f5e8c085c8f3046d5859b627fb7de5c773c26cffcfcbc75364971121a90
DEBUG autonomi_get_file()
DEBUG calling files_download.download_from()
DEBUG Ok() return
Retrieved 141 bytes
published  : 2024-09-16 11:35:16.862655237 UTC
directories: 1
files      : 1
total bytes: 2175
1510a27adde292bf39953e1d181fdf0253b238cf09f94013b7e0c4ada8c0d50d 2024-09-16 11:33:35 "/index.html" 2175 bytes
entry 2 - fetching metadata at 5fb227c3d914852bb7731a55f5feb1e4854ce007b1201e617597d6fb080362b0
DEBUG get_website_metadata_from_network() at 5fb227c3d914852bb7731a55f5feb1e4854ce007b1201e617597d6fb080362b0
DEBUG autonomi_get_file()
DEBUG calling files_download.download_from()
DEBUG Ok() return
Retrieved 143 bytes
published  : 2024-09-17 11:55:57.466132193 UTC
directories: 1
files      : 1
total bytes: 2319
e8ffe587101cfdbccbfa9736952aae933a5a33c726598ade34e609399eae7aa7 2024-09-17 11:55:29 "/index.html" 2319 bytes
======================
Root (Latest) Node(s):
[ 0] Node("0"..) Entry(5fb227c3d914852bb7731a55f5feb1e4854ce007b1201e617597d6fb080362b0)
======================
Register Structure:
(In general, earlier nodes are more indented)
[ 0] Node("0"..) Entry(5fb227c3d914852bb7731a55f5feb1e4854ce007b1201e617597d6fb080362b0)
  [ 1] Node("1"..) Entry(d05b3f5e8c085c8f3046d5859b627fb7de5c773c26cffcfcbc75364971121a90)
    [ 2] Node("2"..) Entry(5ebbbc4f061702c875b6cacb76e537eb482713c458b9d83c2f1e86ea9e0d0d0f)
======================

Here is the output of awe when it successfully updates that register writing a new value of e823ee142c6dbb216c58e6d3c66847fead81a6381e6cc732d26e1bdf41e91047 but which is not present in the 'audit' output immediately above.

You can see that it is the correct register (a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba9900) queried above, and the the update is successful.

That update was done at 5pm Tuesday but the register output above shows it is still not being reflected by the network at 12:40 Wednesday.

Reading /home/safe/src/safe-browser/awe-sites/awe-some-sites-src/sites-community.txt
set LIST_NAME=Community Pioneer Websites
Reading /home/safe/src/safe-browser/awe-sites/awe-some-sites-src/sites-test.txt
set LIST_NAME=Test Websites
Inserting links and saving to /home/safe/src/safe-browser/awe-sites/awe-some-sites/content/index.html

=======================================================================================
upload_site(/home/safe/src/safe-browser/awe-sites/awe-some-sites, content)
---------------------------------------------------------------------------------------
Found register : a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba9900
4a0a4017bc2147cbe3d2c882c1308afa00cc
in register file: /home/safe/src/safe-browser/aweb-addresses/public-network/awe-some-sites/register-address.txt

Updating on Autonomi public from: /home/safe/src/safe-browser/awe-sites/awe-some-sites/content
Autonomi client initialising...
Connecting to the network using 25 peers
Uploading website from: "/home/safe/src/safe-browser/awe-sites/awe-some-sites/content"
Files upload attempted previously, verifying 4 chunks
4 chunks were uploaded in the past but failed to verify. Will attempt to upload them again...
"/home/safe/src/safe-browser/awe-sites/awe-some-sites/content" will be made public and linkable
Splitting and uploading "/home/safe/src/safe-browser/awe-sites/awe-some-sites/content" into 4 chunks
**************************************
*          Uploaded Files            *
**************************************
Uploaded "index.html" to address a17af7a1f1004f9996dc3b5c78fc5607e428f7a5d61a21b3510870595a586245
Among 4 chunks, found 0 already existed in network, uploaded the leftover 4 chunks in 1 minutes 22 seconds
**************************************
*          Payment Details           *
**************************************
Made payment of NanoTokens(4) for 4 chunks
Made payment of NanoTokens(4) for royalties fees
New wallet balance: 0.000000058
web publish completed files: [("/home/safe/src/safe-browser/awe-sites/awe-some-sites/content/index.html", "index.html", ChunkAddress(a17af7))
]
WEBSITE CONTENT UPLOADED:
a17af7a1f1004f9996dc3b5c78fc5607e428f7a5d61a21b3510870595a586245 "/home/safe/src/safe-browser/awe-sites/awe-some-sites/content/index.html"
DEBUG publish_website_metadata() website_root '/home/safe/src/safe-browser/awe-sites/awe-some-sites/content'
Adding '/home/safe/src/safe-browser/awe-sites/awe-some-sites/content/index.html' as '/index.html'
wallet_dir: "/home/safe/.local/share/safe/client"
Paid 0.000000001+0.000000001 to store Website metadata, now uploading...
WEBSITE METADATA UPLOADED:
awm://e823ee142c6dbb216c58e6d3c66847fead81a6381e6cc732d26e1bdf41e91047
Updating versions register a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bc
fd3eba99004a0a4017bc2147cbe3d2c882c1308afa00cc
VersionsRegister::sync() - this can take a while...
VersionsRegister::sync() - ...done.
VersionsRegister::sync() - this can take a while...
VersionsRegister::sync() - ...done.
website_metadata added to register: e823ee142c6dbb216c58e6d3c66847fead81a6381e6cc732d26e1bdf41e91047
VersionsRegister::sync() - this can take a while...
VersionsRegister::sync() - ...done.

WEBSITE UPDATED (version 2). All versions available at XOR-URL:
awv://a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba99004a0a4017bc2
147cbe3d2c882c1308afa00cc

NOTE:
- To update this website, use 'awe update' as follows:

   awe update --update-xor a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bc
fd3eba99004a0a4017bc2147cbe3d2c882c1308afa00cc --website-root /home/safe/src/safe-browser/awe-sites/awe-some-sites/content

- To browse the website use 'awe awv://<XOR-ADDRESS>' as follows:

   awe awv://a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba99004a0a
4017bc2147cbe3d2c882c1308afa00cc

- For help use 'awe --help'

Metadata address added to:
  /home/safe/src/safe-browser/aweb-addresses/public-network/awe-some-sites/site-addresses.txt
Files addresses added to:
  /home/safe/src/safe-browser/aweb-addresses/public-network/awe-some-sites/file-addresses.txt
awe-update-builtins
url: 5ebbbc4f061702c875b6cacb76e537eb482713c458b9d83c2f1e86ea9e0d0d0f
Generating back-end /home/safe/src/safe-browser/awe/src-tauri/src/generated_rs/builtins_public.rs:
Clearing back-end /home/safe/src/safe-browser/awe/src-tauri/src/generated_rs/builtins_local.rs:
awe-update-builtins: SKIPPING GIT OPERATIONS - please copy /home/safe/src/safe-browser/awe/src-tauri/src/generated_rs/builtins_public.rs usin
g vps-sync to-laptop and commit manually
Generating front-end /home/safe/src/safe-browser/awe/src/generated/builtins-public.js:
url: awv://a223f580ce058a3334028fbd3f2497502aae85e9ea703a61bd39d96f772d7b599759117f6e621ed5273acb1e2920ff56f812446a3bda96a2bcfd3eba99004a0a40
17bc2147cbe3d2c882c1308afa00cc
Updated builtins at:
/home/safe/src/safe-browser/awe/src-tauri/src/generated_rs/builtins_public.rs
/home/safe/src/safe-browser/awe/src/generated/builtins-public.js
happybeing commented 1 month ago

^^ @loziniak

Have you used registers on the latest public network?

loziniak commented 1 month ago

No, I'm struggling with local still... :-P

maqi commented 1 month ago

Hi, @happybeing ,

thx for the further update and the detailed info provided.

I am still using the 'manual' patch which you suggested above to build my client

I'd suggest you use the above mentioned PR 2103 to replace the manual patch. The previous suggested one might mute some error that shall be raised to your awe app. (it was just for a quick diagnose/pin the issue, not supposed to be used for long run :) ) And because of that, it might give you a wrong impression that your update succeeded, but actually it failed somewhere along the flow.

Register returns different numbers of entries

that numbers of entries refers to the update history right ? MerkleReg (used by Register internally) uses a DAG structure internally, which could generate different nodes (entries) / branches due to different update path. As I understand, as long as the root value (i.e. the one you got from .read() function) matches the final expected value, you don't need to care about the update history (i.e. number of entries) ? as it is supposed to be vary ?

happybeing commented 1 month ago

As I understand, as long as the root value (i.e. the one you got from .read() function) matches the final expected value, you don't need to care about the update history (i.e. number of entries) ? as it is supposed to be vary ?

That may be the case for some uses but not in all cases. If you want a version history (e.g. versioned web, versioned file-system) then you need to access the history, not just the final merged value.

MerkleReg (used by Register internally) uses a DAG structure internally, which could generate different nodes (entries) / branches due to different update path.

I don't think that is the explanation in my use case but can't be sure. However, perhaps the issue is related to your first point - my code still using the patch. I was awaiting the merged PR before doing any more so we can see if this error goes away once that is in stable and the network is once again reset/updated.

The use cases for Registers, their API and the network implementation are still undefined 'launch' is supposedly one month away. :man_shrugging:

happybeing commented 1 month ago

On a local testnet built against safe_network stable-2024.10.1.2 I am still getting the following error for some attempts to update a Register:

Failed to update website version: Failed to add XorName to register: Network(GetRecordError(RecordDoesNotMatch(0cb113(9f0430935c26b71d82f284a8bccdd9e28b6155bec7ff3c5c6d07b10fb38f4561))))

I attach the full log of my local test output which includes setting up the local testnet and then running awe to upload several websites. This includes both creating and in some cases updating the website. Thu 3 Oct 16:10:04 BST 2024-awelog-upload-local.gz

RolandSherwin commented 1 week ago

Hey @happybeing, does this PR https://github.com/maidsafe/safe_network/pull/2270 fix the issue for you?

happybeing commented 1 week ago

Thanks for the head's up.

I don't have much time for testing and am not sure if my app will have been broken by recent EVM/API changes so may take a while to test but I hope to do so once it is in stable.