UnresolvedDependencies error should not actually be an 'err'

Connoropolous commented 5 years ago

while running the nix-shell --run hc-app-spec-cluster-test bumped up to 8 nodes, we start seeing some erroneous behaviours.

It seems to be affecting both workflow/hold_entry and workflow/hold_link.

Some nodes successfully Hold the entry or link, while others don't, the ones that don't show:

[C2] 2019-05-29 20:14:10 ThreadId(1):test-1: err/workflow/hold_entry: QmUFxhsmL1krv59jL9rRQBKj8Jn311Tpz95KjR7zKEariU is NOT valid! UnresolvedDependencies([HashString("HcSCitO5WWuzmbyf7sfBZp7Oq5Y58Wkq7wvMA4gP4r364ideWxQkpHjko4ndrni"), HashString("QmWY98zemN4Mi9A9SgmC8vQufaQUf5a13Xe81PfWDvvnn9")])

...

[C3] 2019-05-29 20:14:10 ThreadId(1):test-1: err/workflow/hold_entry: QmUFxhsmL1krv59jL9rRQBKj8Jn311Tpz95KjR7zKEariU is NOT valid! UnresolvedDependencies([HashString("HcSCitO5WWuzmbyf7sfBZp7Oq5Y58Wkq7wvMA4gP4r364ideWxQkpHjko4ndrni"), HashString("QmWY98zemN4Mi9A9SgmC8vQufaQUf5a13Xe81PfWDvvnn9")])

...

[C4] 2019-05-29 20:14:10 ThreadId(1):test-1: debug/workflow/hold_link: invalid! UnresolvedDependencies([HashString("HcSCitO5WWuzmbyf7sfBZp7Oq5Y58Wkq7wvMA4gP4r364ideWxQkpHjko4ndrni"), HashString("QmWY98zemN4Mi9A9SgmC8vQufaQUf5a13Xe81PfWDvvnn9")])
2019-05-29 20:14:10 ThreadId(1):test-1: err/workflow/hold_entry: QmUFxhsmL1krv59jL9rRQBKj8Jn311Tpz95KjR7zKEariU is NOT valid! UnresolvedDependencies([HashString("HcSCitO5WWuzmbyf7sfBZp7Oq5Y58Wkq7wvMA4gP4r364ideWxQkpHjko4ndrni"), HashString("QmWY98zemN4Mi9A9SgmC8vQufaQUf5a13Xe81PfWDvvnn9")])

...

[C7] 2019-05-29 20:14:10 ThreadId(1):test-1: err/workflow/hold_entry: QmUFxhsmL1krv59jL9rRQBKj8Jn311Tpz95KjR7zKEariU is NOT valid! UnresolvedDependencies([HashString("HcSCitO5WWuzmbyf7sfBZp7Oq5Y58Wkq7wvMA4gP4r364ideWxQkpHjko4ndrni"), HashString("QmWY98zemN4Mi9A9SgmC8vQufaQUf5a13Xe81PfWDvvnn9")])

My test shows that we should see 16 successful Hold actions, but we only see 12, corresponding to these 4 which failed

Reproduce

Change the following line in /holonix/app-spec-cluster/src/test.nix:

( EMULATION_HOLOCHAIN_BIN_PATH=./.cargo/bin/holochain node ./app_spec/cluster_test/index.js 2 )

to

( EMULATION_HOLOCHAIN_BIN_PATH=./.cargo/bin/holochain node ./app_spec/cluster_test/index.js 8 true )

also change the timeout of 5 seconds in /app_spec/cluster_test/index.js from 5000 to 20000 to give it more time to complete and show you the results.

Then run nix-shell --run hc-app-spec-cluster-test

Connoropolous commented 5 years ago

Tagging @ddd-mtl @dymayday @struktured @lucksus @AshantiMutinta

can we please get this issue, at least looking into it, prioritized? I am not capable of saying where exactly the issue lands, core or n3h or both

Connoropolous commented 5 years ago

Even @maackle and @thedavidmeister might have insight, given the note about "UnresolvedDependencies" which I believe was a topic of discussion between you

jmday commented 5 years ago

Thanks @Connoropolous . I've created a cousin-issue in lib3h at

https://github.com/holochain/lib3h/issues/83

and requested that interaction on this issue be done here.

Connoropolous commented 5 years ago

The issue will relate at this time to n3h not lib3h, but all good!

ddd-mtl commented 5 years ago

Can we all, as a process, have Core team investigate issues first before duplicating them in lib3h ? I have no idea what debug/workflow/hold_link: invalid! UnresolvedDependencies means, as its a Core thing. So maybe its a Core only issue? Once Core team establishes that the problem is related to networking, I'd be glad to help. Doing it the other way around just feels couter-productive and brings noise to lib3h's issue board.

jmday commented 5 years ago

Thanks for the clarification regarding n3h applicability @Connoropolous . The networking team will be placing n3h tasks within the lib3h issues/kanban that it's using so we can manage both n3h and lib3h issues in the same place.

jmday commented 5 years ago

Thanks @ddd-mtl for indicating you don't have knowledge of the UnresolvedDependencies message.

Since getting n3h working for the needs of HC, Holo, hApps, Growth, and Community is critical path AND since Connor wasn't able to determine whether core, n3h, or both are related to this error, it seemed appropriate to bring visibility to both teams in their chosen tools.

Your comment of "maybe it's a Core only issue?" suggests that uncertainty remains about whether n3h or both n3h + core integration holds some impact on this error. Where there are remaining uncertainties like this on a critical path item, it might be best if both teams retain visible awareness of the bug until the issue is resolved as they both may be able to contribute important and timely insight.

lucksus commented 5 years ago

I see different behavior when trying to reproduce. First run was successful with "all nodes holding all entries..".

2nd and 3rd don't exit and I see lots of


[C1] 2019-05-30 13:10:50 ThreadId(1):test-1: debug/scheduled_jobs: tick

[C4] 2019-05-30 13:10:50 ThreadId(1):test-1: debug/scheduled_jobs: tick

[C3] 2019-05-30 13:10:50 ThreadId(1):test-1: debug/scheduled_jobs: tick

I don not see "unresolved dependencies", which BTW should not be an error since it can be recovered from by retrying the validation which is what happens.

But instead I see this several times:

[C1] warn/consistency: Publishing address that was not previously committed

I suppose this is an n3h log, @ddd-mtl, @neonphog?

lucksus commented 5 years ago

Ok, that is clearly a different problem and might be related with persisted state of conductor and n3h getting out of sync. I was entering nix-shell and then rerunning the cluster test which reproducible leads to errors on 2nd attempt and further.

If I enter into a clean nix-shell or run it with nix-shell --run hc-app-spec-cluster-test it works every time without error. Here is my log of the consistently working run: cluster-8.log

lucksus commented 5 years ago

I got bold and tried with 16 nodes which fails: cluster-16.log

I see this at the end which seems to be the cause for the test locking up:

!C13! (@realmode@) [e] (SCiO) #### CHECK FAILED for (agent) "HcSCjwQ8aBP449S78mBAg6rZXj4fkg6538KQci3jUqjqsqtv4NU96xWdZM69Kya" + (DNA) "QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP" (sender: HcSCIYCUTbYt5bs34t83JEsFUKgN4zufv9HftjBtPOuohtycU3wZaKjH3oddq9a)
(@realmode@) [e] (SCiO) #### CHECK FAILED for (agent) "HcSCJRp65yqoHfscf3IZzFpnXBrabxccfcsne6nsO3ybog4xiJVG8FOScatuhrz" + (DNA) "QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP" (sender: HcSCIYCUTbYt5bs34t83JEsFUKgN4zufv9HftjBtPOuohtycU3wZaKjH3oddq9a)

And I see a lot of these in the beginning:

!C15! (@realmode@) [w] received gossipNewTrack from unknown peer (ScIR)
(@realmode@) [w] Received peerHoldRequest for untracked DNA QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP
(@realmode@) [w] received gossipNewTrack from unknown peer (ScJ5)
(@realmode@) [w] Received peerHoldRequest for untracked DNA QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP
(@realmode@) [w] received gossipNewTrack from unknown peer (SCJb)
(@realmode@) [w] Received peerHoldRequest for untracked DNA QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP
(@realmode@) [w] received gossipNewTrack from unknown peer (SCIK)
(@realmode@) [w] Received peerHoldRequest for untracked DNA QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP
(@realmode@) [w] received gossipNewTrack from unknown peer (SCJy)
(@realmode@) [w] Received peerHoldRequest for untracked DNA QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP
(@realmode@) [w] received gossipNewTrack from unknown peer (SCJV)
(@realmode@) [w] Received peerHoldRequest for untracked DNA QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP
(@realmode@) [w] received gossipNewTrack from unknown peer (SCIe)
(@realmode@) [w] Received peerHoldRequest for untracked DNA QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP
(@realmode@) [w] received gossipNewTrack from unknown peer (SCJa)
(@realmode@) [w] Received peerHoldRequest for untracked DNA QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP
(@realmode@) [w] received gossipNewTrack from unknown peer (SCI5)
(@realmode@) [w] Received peerHoldRequest for untracked DNA QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP
(@realmode@) [i] (ScIU) REGISTERED AGENT HcSCiVF8SMcmaybncycPTK8vQXEn9ij69FagokBopnagqzikPFkcWCZhfcistbz QmU9Q5dtg17gmUEayBwAfr4rT7L9R4PcSGUS4rVjF8L8dP hc://HcScIUgN9RS7aa9eioEbJ64s76jQ6VYc7yWH3oVu6sNavqc9ZRVkVE7T48nurnr
(@realmode@) [w] received gossipNewTrack from unknown peer (Sciu)

Connoropolous commented 5 years ago

@lucksus interesting stuff! As for "[C1] warn/consistency: Publishing address that was not previously committed" That actually comes from the recent Consistency Signals PR #1431

Connoropolous commented 5 years ago

Let me see if it is a matter of timing, and whether it does retries and succeeds

Connoropolous commented 5 years ago

I am guessing @lucksus that you are running on your beast machine? I've been very curious to know what would happen if you tried on that machine

Connoropolous commented 5 years ago

By upping my amount of allowed time, I see that it succeeds in validating and holding on subsequent retries, and with 8 nodes I get 16 Hold actions which is what I expect. Yay!

Now to try to determine the specifics of the other error messages and warning that are showing up.

ddd-mtl commented 5 years ago

CHECK FAILED, unknown peer and untracked DNA warnings are basically timing issues, unless you have rogue peers in your test.

lucksus commented 5 years ago

Yes, I'm running on an AMD Ryzen Threadripper 2950X 16-Core Processor (32-Thread), 64GB ram, NVMe SSD beast :)

Connoropolous commented 5 years ago

I am reopening this because I actually think that this Err message shouldn't be an 'err' and should be reported as a warning instead. I think the err message needs to change... it's currently:

"[C7] 2019-05-29 20:14:10 ThreadId(1):test-1: err/workflow/hold_entry: QmUFxhsmL1krv59jL9rRQBKj8Jn311Tpz95KjR7zKEariU is NOT valid! UnresolvedDependencies([HashString("HcSCitO5WWuzmbyf7sfBZp7Oq5Y58Wkq7wvMA4gP4r364ideWxQkpHjko4ndrni"), HashString("QmWY98zemN4Mi9A9SgmC8vQufaQUf5a13Xe81PfWDvvnn9")])"

But I don't think an UnresolvedDependencies err should be reported as "NOT valid!" I believe this should be a warning instead.

Yeah, that is here, and is really misleading: https://github.com/holochain/holochain-rust/blob/c01fc4a3d70a45f5259accd6b90c7c3e50e40817/core/src/workflows/hold_entry.rs#L68-L83

Connoropolous commented 5 years ago

@maackle @lucksus @thedavidmeister @AshantiMutinta could y'all or some of you discuss what's appropriate to do here?

lucksus commented 5 years ago

Yes, totally agree. I would even say it should be a "debug/.." message and we could think about adding a warning if something couldn't be validated after 3 minutes.

holochain / holochain-rust

UnresolvedDependencies error should not actually be an 'err' #1452

Reproduce