helium / miner

Miner for the helium blockchain
Apache License 2.0
609 stars 266 forks source link

unable to locate POC ID - failed to dial and others errors #1353

Open Biscottin0 opened 2 years ago

Biscottin0 commented 2 years ago

Hi, I have a Pisces P100 and in the dashboard I find many errors of different types.

_2022-01-10 19:47:38.183 6 [info] <0.6412.0>@miner_onion_server:decrypt:{356,25} unable to locate POC ID <<"RgKVg_jxVZzjWwT46G9zP7L12FpyBd7oIzMDR4_jrvQ">>, dropping

2022-01-10 19:57:14.488 [error] <0.1586.0>@blockchain_state_channels_client:handle_info:{205,5} failed to dial "/p2p/11w77YQLhgUt8HUJrMtntGGr97RyXmot1ofs5Ct2ELTmbFoYsQa": timeout dropping 5 packets

2022-01-11 18:21:53.111 6 [error] <0.1483.2>@miner_onion_server:send_witness:{207,5} failed to send witness, max retry 2022-01-11 18:21:53.111 6 [info] <0.1483.2>@miner_onion_server:sendwitness:{246,37} re-sending witness at RSSI: -116, Frequency: 867.3, SNR: -21.5

Did any of you manage to solve these problems or would you know what to do? Also it seems that my Hotspot works fine as soon as I restart and clean the blockchain but after about 20h it stops doing activity. Only if I clean the blockchain and re-sync it will it work decently. Thanks a lot to those who can give me a hand!

serbyxp commented 2 years ago

I don't think anyone has solved anything. I have had a a bunch of different issues, and this morning I checked my logs again, I had that issue show up today as well. There is something very wrong on the p2p network, That I believe based on numbers, that on the dev team side sees nothing wrong. say if 10000 people have the issue out of 500,000 thousand . I think they are working more on the light hotspot role out as per the documentation it says its going to be out before the end of Q1 ... that will solve the issue since the network is getting off the p2p network. So I think they are focusing on the roll out of that and not fixing the broken issue. I could be wrong, but there is 1000s of hotspots effected by this. --- this is all speculation ---

I even bought different hardware thinking I had broken hardware ( concentrator card) that didn't help either... no clear answer to what is going wrong from any one on the hellium end of things.

From what I have gathered over the past month with this issues, which fluctuate from release to release, Last release the pre 2022, I was syncing fine and sending beacons that wouldnt be received by anyone and no witnessing, all my rewards were challenger rewards and beacons with 0 witnesses. Post 2022, My miner has been syncing for days now, I even re did it 2 days ago fresh install fresh OS fresh blockchain, and its been stuck in the " syncing" state but I am receiving some witness rewards. and challenger rewards, but I haven't sent a beacon out in about a week. The syncing on the API on explorer shows last updated 6 days ago, why is that I'm not sure. There is also an isse with the snapshot handler, That I think is my issue with the syncing, it asks for a saved-snap and it throws some 112 error in erlangiot file some have said this is code on the validators side which I can not confirm. But the combination of all this errors, has flat lined a lot of miners

What I don't understand or maybe its a normal operation of the network, If I'm not synced to the chain, yet I'm still challenging, and I am still earning witness rewards. So being " synced" still lets me receive witness rewards, but I don't think it lets me be a challengee only a challenger .

Besides the issue that you described, which is failed to send the witness, which I can confirm is not just you, there is a PoC witness receipt that gets sent and is " received" but is not posted as a valid witness. I believe that issue, is just the fact that the total number of witnesses or the random lottery, of max witness allowed to get rewards is the reason those receipts don't get posted, but then again there are other User hotpots that claim that is not the issue because there aren't more then the max of 16 or 18 whatever the max number is, on the posted blockchain receipt. No one has come out and Officially said what the issue is. But it is an ongoing issue for the past 3 + months.

I noticed that the ARM version is the one suffering the most, I see they put out a new AMD version yesterday which I have downloaded and finished syncing over night. I am going to try and throw my EEC card onto my AMD based laptop, with the concentrator card, And see if that resolves it.

Csubi76 commented 2 years ago

o I noticed that the ARM version is the one suffering the most, I see they put out a new AMD version yesterday which I have downloaded and finished syncing over night. I am going to try and throw my EEC card onto my AMD based laptop, with the concentrator card, And see if that resolves it.

How to run AMD version? I would like to try it in a virtual machine too.

serbyxp commented 2 years ago

I gave up on it, since the latest update. But I was running it on a laptop with the same basic setup instructions on the helium docs. With docker. But I had to rig up a socket in python for the i2c bus to go over USB and that won’t work on a out of the box hotspot firmware.

*Tho while it’s syncing with no eec key it will sync fine on the un approved swarm key that is file generated , you just won’t be able to do anything with that. It will connect to peers but not make any PoC or any of that..

I’m not a coder so when I say rig up I mean rig up, I think what got it working for me, might be that remote gpio which is an option in raspi-config but I couldn’t say LOL

Another option I had in mind that I tried first was connecting the eec chip to an esp32 i2c bus and using micropython over the usb bus of the esp32 CHxx chip but I got tired of switching the thing back and forth.

you need to change the container option also for i2c-1:12c-1 to whatever the com port is /dev/ttyUSBxx:i2c-1 it’s not for the faint of heart and I tried a lot of things untill it worked, but as of the new release I just re provisioned the raspi64 to fresh bullseye64 and ran it like normal, I have noticed more witnesses but I haven’t dug anymore into it I wasted a month with that , back on auto cad doing things I know how to do…

my reason for trying all the hardware options and not just learning the VM option was because of the light hotspot implementation, that is in the “works” in future updates. So I wanted to have all my code / hardware ready for deployment.

All logs posted here for review have all been arm64 running basic helium documentation steps, and arm64 related issues. I just want that to be clear

serbyxp commented 2 years ago

@Csubi76 Are you having an Erlang error? like a "gen_server :call/2 line 239 ??? or something similar. My original reason that I jumped over to the AMD version to see if it was calling any errors was because someone mentioned an Erlang issue with the libp2p I don't know anything about that as I mentioned I'm not a coder. But I did a quick search on google about that gen_server:call/2 line 239 error and found something on a rabbitmq forum that mentioned that the persons server was not running erlang 21 OTP, The person mentioned That they changed their server to Erlang 21 and it fixed the issue for them. I dont remember where I posted that in this forum, But I believe the list of people that I am having the Dial errrors with are running a diffrent version of Erlang on their machines, Im going to look deeper into all my failed to dial error addresses, And see if they are validators, Running the AMD version in the Cloud. I remember mentioning in that post, That who ever compiled the AMD version should compile the ARM version. because of this. Call me crazy but I think I figured this out weeks ago.

But then, If We switch to the AMD version, Arent we going to start having the Dial errors with the opposite group of people??

@madninja Would it be hard to code or compile the Image with both Erlangs?? like with some sort of " IF gen_server:call/2 line 239. Then run Erlang 21 OTP " I dunno how that would work but someone that does can fix that in a second.

Im starting to research iff anyone has a better source or explenation of this please let me know before I go down this rabbit hole... Im starting here...

  1. https://linux.die.net/man/3/gen_server
  2. https://www.erlang.org/doc/reference_manual/code_loading.html
serbyxp commented 2 years ago

@Csubi76 I did a dual miner test yesterday with my setup at like 3-4am Est standard time, I’m not sure if it was a coincidence or what but my miner after 2 weeks actually synced to the helium API. I did as a test with my laptop. So it looks to interact with the API better on AMD.

@Csubi76 i am trying to learn this Docker buildX but I don’t find good documentation ( I don’t know docker very will) but I read in the main repo here on the build instructions , that the AMD version needs a AVX or something capable processor. I don’t know if you got to work in VM ?? I looked to see if PI4 is AVX capable , I found in TensorFlow documentation that PI4 is AVX capable, most old miner like Nebra origonal had PI 3 which I did not see is AVX capable. Something like ARMv7 vs ARMv8 is different. *I am using a pi4 4gb the first revision though. With ARM64 so I should be able to get it to work, in time. Any resource or documentation on usage of the —platform option usage or VM would save me some time. Tackle the issue faster then learning all new stuff from 0.

Any luck? Because I was only testing and did not want to exceed a “testing” time period I stopped the AMD version running on laptop. I’m now trying to figure out the ( learn from scratch) erlang. In the main helium miner repo, I have gone through all of the files and made some notes. I don’t think anything wrong with helium main repo and code looks fine. Issue is not there.

Today if I find some focus time I am going to go through the external repos, or I guess what they call dependency and look at the difference between the two types.

@madninja im sure you know or are aware of this,( you wrote the instructions so it seems) but since it’s a coding issue I just like to mention you here, so you can see the different miner hardware types and maybe make a correlation between Issues on different hardware vs the software dependency.

If you haven’t noticed I am good at hardware and installing /modifying drivers, as per “the manuals and documentation. Coding I can read and understand any language but I’m not proficient at writing it . In sort I’m a hardware guy not a software guy. thank you

Csubi76 commented 2 years ago

@serbyxp Sorry for the late reply.

Yes, I have such error messages too. (regularly)

2022-01-24 20:16:00.821 [error] <0.1655.0>@blockchain_state_channels_client:handle_info:{205,5} failed to dial "/p2p/11w77YQLhgUt8HUJrMtntGGr97RyXmot1ofs5Ct2ELTmbFoYsQa": timeout dropping 2 packets

Unfortunately, I was unable to set up the virtual machine. (Sometimes I feel this is a HELIUM limitation due to the many hotspots.)

Kaktusman2022 commented 2 years ago

Same failure here, end April 2022 "/p2p/11w77YQLhgUt8HUJrMtntGGr97RyXmot1ofs5Ct2ELTmbFoYsQa": timeout dropping 2 packets2022-04-30 10:54:36.084 [error] <0.1943.0>@blockchain_state_channels_client:handle_info:{205,5} failed to dial "/p2p/11w77YQLhgUt8HUJrMtntGGr97RyXmot1ofs5Ct2ELTmbFoYsQa": timeout dropping 2 packets2022-04-30 10:55:20.771 [error] <0.1943.0>@blockchain_state_channels_client:handle_info:{205,5} failed to dial "/p2p/11w77YQLhgUt8HUJrMtntGGr97RyXmot1ofs5Ct2ELTmbFoYsQa": timeout dropping 2 packets2022-04-30 10:56:26.039 [error] <0.1943.0>@blockchain_state_channels_client:handle_info:{205,5} failed to dial "/p2p/11w77YQLhgUt8HUJrMtntGGr97RyXmot1ofs5Ct2ELTmbFoYsQa": timeout dropping 2 packets2022-04-30 10:57:15.449 [error] <0.1943.0>@blockchain_state_channels_client:handle_info:{205,5} failed to dial "/p2p/11w77YQLhgUt8HUJrMtntGGr97RyXmot1ofs5Ct2ELTmbFoYsQa": timeout dropping 2 packets2022-04-30 10:57:15.449 [error] <0.1943.0>@blockchain_state_channels_client:handle_info:{205,5} failed to dial "/p2p/11w77YQLhgUt8HUJrMtntGGr97RyXmot1ofs5Ct2ELTmbFoYsQa": timeout dropping 2 packets

serbyxp commented 2 years ago

As far as the failed to dial errors that’s the network in general not specific to “helium” itself but the fact that theirs so many variables that are involved. Relayed hot spots , different countries, ISPs. I after months I think it has a lot to do with the peer nodes. I raised my peer count to something ridiculous like 150 inbound 30 outbound. And the refresh interval to like 45000 or so I’m not in-front of it now. I also added a bunch of the seed node peers, all that in the sys.config file . Then Atleast my peer book -c ( count ) went over 400k with out that my peer book count would stay somewhere in the 340k max . Doing that, with the “peer connect” script works a lot better for the failed to dials. But whenever they figure out the light hotspot thing we should be good to go! - “whenever” -

i suggest if any of the devs read this that light hotspot / or validators do some sort of “tunneling” network ( vpn , or proxy ) so each hot light spot can connect through the “tunnel” for gRPC to have a “feed” or something like “websockets” but we got to wait and see how it gRPC actually works in a none TestNet enviroment ( with 700k hotspots) not the handful they test-net with on most likely the same aws network servers etc …

I did a DNS “Dig” / “nslookup” to seed.helium.io ( or whatever it is) and I got green on all the servers so isn’t that so much…

RepRapid commented 2 years ago

I still don’t understand the point of validators for the past 8 + ~ months pretty pointless if you ask me, they get rewarded for nothing… imho … besides getting rewards for investing in helium. I would figure a better system or use for them, would be to actually validate things… IE. …

challanger -> beaconer -> witnesses -> receipts ->> challanger / validators <<- compare receipts / lottery->> write to blockchain .

that way not only the challanger gets the receipts but the validator gets the receipts too then they compare what they got before the lottery is applied , and written… that would be “validation” that the challanger and validator got the same receipts. Cus a validator can filter out receipts same as how the deny list works, or the challanger. Or both. But Atleast that would help with “self witnessing spoofers” and “packet stuffing” gamers etc .

That way it’s not 1 person in control of everyone’s receipts . In the scenario above all the validators can accept the witness receipts, not just 1 challanger or 1 validator, having full authority over the PoC handling process. Same how any other crypto works, you can write to block chain unless 51+~ % agree that all those receipts are “valid” or all the “transactions” that took place in that PoC were accounted for.

beacause it’s not just failed to dial that occurs on witnesses… the network is actually losing value , because “technically “ it’s value is that it cost DC to transmit over the radios. So if a challanger creates a challenge, and sends it over the network to a chalangee, and the chalangee sends a beacon over the radio, and a bunch of witnesses send the receipt to the challenger successfully , and the chalanger* fails to send the final record of all the receipts. Then that “Beacon “ that got transmitted over the air, was “free” and not accounted for. Thus the value of DC to transmisión does not have a direct correlation.

In a restaurant your food cost has many variables. But if a “steak” costs them $10 from the supplier, and they sell it for $25 $5 might go to other “food cost” aka paying the cook and the gas bill. If the waiter drops the $10 steak on the floor, it needs to be accounted for, because now the food cost went from ( $10s + 5$ ) to ($10s+$10s+$5) thus your profits went out the window. The value is still the same but you wasted 1 extra steak to make that transaction. Helium is the manager In that scenario, the cook did his job he shouldn’t be penalized or not paid because the waiter dropped the plate . The waiter might get penalized because he didn’t handle the plate properly and the manager or company could just dock his pay for the cost, or not reward them , after so many fails the manager is not going to let that person continue handling that type of transaction because it’s costing the companies bottom end.

I’ve had sent plenty of successful receipts , that the challanger had failed to post to the block chain. Which is all fine and dandy. If their was no limit to the amount of beacons that are being sent over the air. But if their is a maximum of 3 per day ( more accurately 1.5 with epoch at 975) then ever witness that is successfully sent should be accounted for, even if you don’t win the lottery Pool. So A hotspot can go 2 days witnessing and sending receipts successfully / unsuccessfully . Doing their “intended function” or job. But some other cog down the line is not letting them get rewarded for it… eventually the “value” of PoC is not worth the “work” . And the total value of the network will decline. Because now you got 700k hotspots that have the “tool” to do the work on their own… or form their own network. Got to keep the cogs out before the value goes so far down. 5g roll out is just a way to prolong the inevitable, if the network doesn’t correct itself.