Testing for cluster - Githubissues

uaktags commented 5 years ago

While looking at the idea of #13, I started wondering, if we have a separate sync.js application, would it be faster to communicate via cli vs TCP/RPC. Keep the sync and daemon local to eachother, only have 1 connection outbound to the mongodb.

To test, I'm going to try doing a timed reindex of Vulc for 10minutes say 3x each. So 3x with RPC to localhost daemon, 3x RPC to a remote server on a LAN, and 3x with local cli. The goal is to see if we'll get any performance improvement.

uaktags commented 5 years ago

So....exec the filesystem is both a stupid idea (for one) and also slow from node. 1000 blocks tooks well into 4minutes as a new process spun up just to get the stdout.

So I went with just removing the main explorer out all together, which is related to #13 by using settings.wallet and bitcoin-core rather than sending everything over to http://explorer.com/api/method, which translates into 3 additional calls verse our 1.

1000 blocks the old way takes 34seconds, 1000blocks the new way takes 26seconds which is 1/3 less time. Going for 10k blocks right now on VULC as test.

uaktags commented 5 years ago

{ tx_count: 19231, address_count: 527, seconds: 585.792 } <= old method.

{ tx_count: 19231, address_count: 527, seconds: 444.587 } <= new method

So there's definitely an improvement to be had. The 33% speed boost didn't hold up, unfortunately, but 24% is still noticeably faster. It's especially compelling when the thought is that this could be happening on a separate box from your webserver where the explorer is running.

uaktags commented 5 years ago

Could this be made any faster with simple clustering? Test is going now.

uaktags commented 5 years ago

{ tx_count: 19222, address_count: 527, seconds: 78.978 } <= cluster method. Well...somethings wrong with my start/stops because we're missing 9 transactions, but that...is fast.

uaktags commented 5 years ago

{ tx_count: 19227, address_count: 527, seconds: 78.628 }

hmm... I'm at a loss with this one on where these last ones are going. I found that my code was missing the last index (so if I wanted to count 1000, it'd go to 999) But that's not the case anymore. More tests needed.

uaktags commented 5 years ago

I'm still working this out, it appears that this is dying somewhere in

function save_tx(txid, cb) {

of the database.js Sticking with a small count like 100, in VULC i should get the following: { tx_count: 105, address_count: 111, seconds: 3.031 } <= new method { tx_count: 105, address_count: 111, seconds: 3.362 } <= old method

instead, with clustering, I get the following: { tx_count: 102, address_count: 107, seconds: 2.116 }

However, when I console.log out everything up to like line 160 of database.js, and then count the console outputs, I have 105. So It's someplace after line 160 which leaves us down to 160 - 189 where the newTx.save occurs (or doesn't occur).

uaktags commented 5 years ago

figured it out 500 records: { tx_count: 513, address_count: 139, seconds: 28.934 } old { tx_count: 513, address_count: 139, seconds: 12.432 } new { tx_count: 513, address_count: 139, seconds: 4.896 } cluster

{ tx_count: 513, address_count: 139, seconds: 16.897 } old { tx_count: 513, address_count: 139, seconds: 12.829 } new { tx_count: 513, address_count: 139, seconds: 4.715 } cluster

{ tx_count: 513, address_count: 139, seconds: 16.934 } old { tx_count: 513, address_count: 139, seconds: 12.754 } new { tx_count: 513, address_count: 139, seconds: 4.862 } cluster

uaktags commented 5 years ago

Bitcoin now: { tx_count: 508, address_count: 509, seconds: 83.864 } old { tx_count: 508, address_count: 509, seconds: 23.67 } new { tx_count: 508, address_count: 509, seconds: 4.437 } clustering

moving on to 5000 blocks { tx_count: 5063, address_count: 5048, seconds: 14.558 } clustering { tx_count: 5063, address_count: 5048, seconds: 459.504 } new

For obvious reasons, I'm not about to run that over the old method. It's pretty clear that clustering has its place here.

vrfebet commented 5 years ago

Hi The term clustering is a unknown term to me. I try to synchronize the coin to your version of the explorer, but thousands of transactions in the block cause synchronization to be slower than creating blocks. Therefore, I would like to know what needs to be done for clustering operations. I runing an explorer on a quad-core processor but it uses up to 18-25% cpu ... I have been reading all the text for several times and I can't understand it.

vrfebet commented 5 years ago

syn_exp

vrfebet commented 5 years ago

So I figured it out and ran it. The difference is really noticeable, we'll see what the result will be ... thanks to nice work (p.s. I rewrote files: benchmark-cluster.js, explorer.js, sync.js from branch "cluster-sync")

vrfebet commented 5 years ago

cluster

uaktags commented 5 years ago

So my use of the term clustering is referring to two parts one which could be considered just multi-threading and the other will hopefully be the idea of using multiple servers as "sync-servers" that work together to speed up the sync operation.

NodeJS is a single-thread process, so only one process is ran "node scripts/sync.js" and only one process-id is ever doing all the work. To mitigate this, there's a node module called "Cluster" which essentially spins up a new "node" process. The elegance of the module is it knows it's role "Master vs Worker". We already leverage clustering with the original explorer for keeping the front-end Express application that runs on port 3001. This is actually multiple node js processes all running the same code to keep the port open by letting multiple processes handle the incoming traffic and operations so no one single thread/process can break the website experience.

Our goal is/was to leverage this same idea, but with syncing. Syncing is slow for the fact that it's single threaded, calls out to the explorer's express app, and then gets routed to the wallet finally. To speed this up we need to do two things: eliminate this additional call back to the explorer, opting to instead communicate with the wallet directly. And two, have multiple threads syncing to handle more work simultaneously.

That new design, by itself isn't clustering in the sense of the word in which most people consider clustering: using multiple computers to handle a single job. Instead you can say I enhanced multiple threads simultaneously.

The follow-up design will probably be a new fork of iquidus to a new project. The new design will be one where there's a front end application "the explorer" and a backend. The back end will be simply things like the sync scripts that you can run on multiple computers to handle your syncing. This will also allow for you to choose to only have the backends running, and then opt out of the NodeJS front-end for instead something else (I've created a PHP laravel front-end that I like a lot for example).

Hope that better explains the intention and goal. Hope it's working great for you.

vrfebet commented 5 years ago

thank you for exhausting answer. I'm starting to understand it slowly. otherwise thank you, the use of more PCs also interested me. I've also understood that it's about CPU clustering. I am now watching how the synchronization takes place in different parts of the blockchain. Is there any way to find a complete percentage of sync? (I mean by following how fast the procedure is) .................................... it probably doesn't belong here but the Top 100 balance page is empty for me

uaktags commented 5 years ago

"Is there any way to find a complete percentage of sync?" Unfortunately not yet, while blocks-wise it could be something simple to do a SetInterval that outputs the current total count of blocks handled vs the block height of the chain, but there's no way that I can think of to handle an approx time or number of transactions. Only thing I have so far is at the end, doing a count that's similar to how the benchmark works.

The top100 won't show until it finishes completely.

vrfebet commented 5 years ago

yes yes, I understand the problem of reading the current status. I am solving it so far by primitive manual deduction of the block for each thread per time slot. at least it gives an overview. Thanks for the info for the top 100.

vrfebet commented 5 years ago

Hi, this situation has happened today and synchronization has stopped. I don't know where to look for information (log) about the cause. (again, sync started to run normally - just stopped) worker4died

vrfebet commented 5 years ago

so here I am again ... Synchronization has stopped (img) with the message: "update Stats Completed" but is not and cannot be restarted even with the "check" parameter. Each time the "sync.js index update" was restarted, the synchronization always started closer to the end of the blockchain, and many blocks were left out of the way, nor did it start with the "check" parameter. Explorer shows as the last block of 611828, but is actually the last is 612421 ...... update-NOT-completed_02

vrfebet commented 5 years ago

so far it seems to have helped to completely reboot the server. after that I started the "index update" by default and it looks like I fill in the empty blocks (I have to say that when I couldn't start the synchronization in any way, I started "index check" from 1.7.2 (no cluster - version) "to catch it" , I do not know if it had any effect - then it all crashed (links) and after the server restart well started well. http://coalmine.eu/eli/check_error.png http://coalmine.eu/eli/check_error_01.png

uaktags commented 5 years ago

So a number of issues that need to be addressed: https://github.com/uaktags/explorer/issues/14#issuecomment-493743951 shows an image that "workers needed" went into negatives. So something's wrong with that tracking and interaction ability.

https://github.com/uaktags/explorer/issues/14#issuecomment-493789240 I suppose you're saying that explorer's Stats is showing Last being 611828, but the last height that's on the Explorer's main screen ("Last Transactions") shows 612421? Or do you mean that Explorer's Stats shows Last being 611828, but the actual latest blockheight from the wallet/daemon is 612421? This clarification is important, because Stats.last just means the last block height that was synchronized, not the highest blockheight on the chain. If instead, Stats.last == the highest block showing on the main Explorer page, then it appears that UpdateStats failed before the last error while updating the blocks, which goes back to the above issue with the workers going into negative and the shown error message.

sync.js index check is actually a pretty dumb feature imo. Its simply doing the same thing as index reindex, just without clearing all records first. Iquidus never built any block/transaction tracking feature which would assist with finding missing blocks/transactions. In theory, a tracking measure such as "block X has Y number of transactions" while also keeping track of the relation between blockheight/blockhash/ and transaction we store. For obvious reasons, this would one increase our database size a good amount, and two mongo isn't a relational db. But it may be good to look into.

The last two images appear unrelated directly with the sync script, but instead show a crash of the nodejs and the explorer's cluster module for the web server (:3001). The js stacktrace shows you're out of memory and the other image of the IPC Channel shows that the cluster failed to launch a new forked process of the explorer (which is why i started working on removing the explorer calls from the sync script and instead go directly after the wallet instead)

vrfebet commented 5 years ago

to query: Explorer statistics show Last 611828, but the latest current blockheight from purse / daemon is 612421 (getblockcount). How does the explorer detect that sync is terminated (technically)? Is there any audit on whether there are all blocks and transactions in DB?

From my experience "sync.js index check" works by adding missing blocks to DB, I never thought that it also does not check transactions in the block!

As I wrote the situation has stabilized after the reboot. I do not know if explorer does some flag that "sync.js index check" was interrupted, but when I started "sync.js index update" after restart, every worker started searching blockchain from the beginning (by 2000 blocks) and adding missing blocks. So I want to ask: How does a worker solve where to start syncing after a reboot? Why didn't the workers search blockchain from the start of the previous restarts?

uaktags commented 5 years ago

So the workers start from where Stats.last shows and then provides 2000blocks to each worker. The Stats.last is the last block to be hit by stats.UpdateOne function in the sync.js file. https://github.com/uaktags/explorer/blob/68c499691435fe219f2d87bfb55175a1539523e4/scripts/sync.js#L252

My thinking is one of two things: either this function of updating at last worker doesn't work accurately (last worker may not be the last/highest blocks). Or when your last crash/reboot happened, this line didn't fully run so it never updated to the last block.

vrfebet commented 5 years ago

After crashing npm and daemon (don't know why - log daemon ends in middle of date and npm / eplorer log don't know where to look) sync ended I don't know where but I have screen shot six hours before reboot. last_01

Here you can see how the synchronization does not follow where it was interrupted. All four workers are able to sync approximately 160 blocks per hour. Notwithstanding that the values could not be so balanced. Compare the previous block heights and the height where they re-established. new_01

This is probably related to the fact that there was a crash and it was not possible to properly load the synchronized block. Gaps of missing blocks are created in the database. How do I handle the replenishment of missing blocks / transactions? It is also related to the fact that in the case of skipping blocks, it is no longer possible to calculate the increments per time slot.

Would it be possible to log the number of processed blocks for all workers (one number) into the log.txt?

I miss the logs for explorer and npm at all (I don't run "forever" now - I know there's a log run).

uaktags commented 5 years ago

The issue with the block before the crash and the first block after the new start falls into one of the two scenarios I described already, if you're using the clustered version.

Basically you crashed before the last worker got a chance to run the Stats.updateOne({coin: coin}, { line in the config. So you don't have the "last" from the truely last worker running prior to crash, but you have the last from the last worker that was able to call Stats.updateOne().

I don't have a way to handle the missing blocks / transactions. These would need to be thought up, coded, and tested (but I welcome any contributions via pull requests). I've made some suggestions of a method that may work, but haven't done any coding for it, and don't expect to until next week (currently traveling Germany).

It would definitely be possible to log into a text file the processed blocks, but I'd the following approach of only including the following: https://github.com/uaktags/explorer/blob/68c499691435fe219f2d87bfb55175a1539523e4/scripts/sync.js#L238 https://github.com/uaktags/explorer/blob/68c499691435fe219f2d87bfb55175a1539523e4/scripts/sync.js#L216 https://github.com/uaktags/explorer/blob/68c499691435fe219f2d87bfb55175a1539523e4/scripts/sync.js#L295

This way you can see when a Worker starts, when one finishes, and what Blocks they were responsible for.

However, a hard part to realize with dealing with Workers, is they start and stop out of order. For example: Worker 1 has blocks 1-2000 Worker 2 has blocks 2001-4000

Worker 2 may finish before Worker 1, and then start on blk4001-6000. If the system crashes before Worker 1 completes, you will have Stats.last showing "4000", because that was the last section of blocks to finish before the crash. You will not know that Worker 1 still has some remaining, or that Worker 2 is now on new blocks and may have finished some in 4001-6000.

Only two ways I can think of to fix this, but neither look good to me: 1) Start writing blocks to the db, which will make the DB grow bigger, but use it only to record the block heights completed. However, this will cause more DB overhead to happen. After a crash, we can then query the database for block numbers that are missing (somehow) 2) Record the block numbers into a text file, have the workers load the text file when they start to get the next X number of block numbers (lines), and delete each block (line) from the text file when it's been completed. After a crash, just load the lines that are still present and keep going.

vrfebet commented 5 years ago

how to stop correctly "sync.js" before rebooting to correctly link blocks?

ctrl + c?
/explorer npm stop?

uaktags commented 5 years ago

There isn't a way to gracefully exit the sync scripts. It was never made to be an interactive script, but a cronjob / pseudo-service.

You just have to ctrl-c, but you will miss blocks.

vrfebet commented 5 years ago

hmm i thought it - i need to change cpu .....

A better solution seems to me to write to the database. the index could then be used to check for missing blocks / transactions, at the cost of more overhead

vrfebet commented 5 years ago

well, I used your hint and added logging to the file. https://github.com/vrfebet/explorer/pull/1/commits/bcadd220d14a211a701f51d6f6666c0b428a71f8 however, I do not know where to capture transactions that run in a terminal, I would like to log them into a separate log using "blocksLog". You can help me? Thanks

uaktags commented 5 years ago

You would need to extend the log4js to lib\database.js as that's where those console.log commands for transactions are being produced. Specifically https://github.com/vrfebet/explorer/blob/8ed38ea51b23a9d466bf41d0c5705fbbf869392d/lib/database.js#L808

On Thu, May 23, 2019 at 4:50 PM PeGoS notifications@github.com wrote:

well, I used your hint and added logging to the file. vrfebet@bcadd22 https://github.com/vrfebet/explorer/commit/bcadd220d14a211a701f51d6f6666c0b428a71f8 however, I do not know where to capture transactions that run in a terminal, I would like to log them into a separate log using "blocksLog". You can help me? Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uaktags/explorer/issues/14?email_source=notifications&email_token=AAG2F5EJUFRRREOXCCUHAQDPW37QZA5CNFSM4HJEEAHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWDOMTI#issuecomment-495380045, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG2F5H4C52GHUKUVHFB66LPW37QZANCNFSM4HJEEAHA .

vrfebet commented 5 years ago

So I will start by first upgrading the cpu to xeon 4/8 in the assumption that I will double the performance, but the desired effect did not arrive and the CPU is used only a little over 50%, peak times max 62%. I don't know where "narrow throat" is at this time xeon-only-50

Second, I changed the setting to "MaxPerWorker = 1" to prevent the block lost from being when restarting and i testing this setting. I run sync in "index check" mode thanks to the creation of a block-only log, I can check that the synchronization will follow the correct block after restart and no one is omitted. https://www.youtube.com/watch?v=h3ITascD8JY&feature=youtu.be

"Check" synchronization works fine until it reaches the already stored blocks, after which all memory and mongodb fall out. By solving the problem of setting up / limiting the memory for the process / worker, this could be a loss-of-block / reboot loss solution https://www.youtube.com/watch?v=8PZx0z88omw

in the right shel it is well observed that a fall occurs when the swap is exhausted to zero

uaktags commented 5 years ago

See I'm not getting the same results as you, my loss of blocks/etc isn't from a memory exhaustion, but appears to be a mathematical/counting issue in the forking process. I'm currently working on a custom "check" function right now that does a Tx.distinct("blockindex") in order to find out what blockHeights we actually have in the database, and then basically figure out the inverse (what blocks we are missing). All of my blocks appear to be the first few blocks of every new forked process.

For example, here's some blocks i'm missing:

[1,2,3,4,5,6,7,8,9,10,2001,2002,2003,2004,2005,4001,4002,4003,4004,4005,6001,6002,6003,6004,6005,8001,8002,8003,8004,8005,10001,10002,10003,10004,10005,10006,12001,12002,12003,12004,12005,14001,14002,14003,14004,14005,14006,16001,16002,16003,16004,16005,16006,18001

The current sync.js in the Cluster-test branch doesn't have a "check" method in it (that was stripped out to be worked on). once I have a working prototype (while it probably wont be the final product I intend to have), I'll update the branch with a commit.

vrfebet commented 5 years ago

temporary solution: using the instructions https://askubuntu.com/questions/178712/how-to-increase-swap-space I simply increased the swap capacity to 16GB (+16GB RAM), but the memory was exhausted again ... ( however, when loading blocks, the application is already stable. So now I have increased the swap to 64GB, but I will see the result in about 7-8 hours until the sync hits the already stored blocks. For synchronization of eight workers (cores) it uses synchronization (if it stores blocks) 16GB RAM 16GB swap / - and swap slowly increases. This high memory usage is the responsibility of checking already stored blocks. The image shows many node processes that i don't know what they are. 64GB-swap

vrfebet commented 5 years ago

if 4 workers worked for 2,000 blocks. 1 worker 0 - 2000 etc ... so, after restarting, worker 1 started from the last block of worker 4. that is, all unfinished blocks of 2,3,4 workers were lost. and that was hundreds. mode "check" for me normally searches for unsaved blocks and saved skips .... it is seen in the videos. I know you wrote that it doesn't work, but does it work?!?

vrfebet commented 5 years ago

of course, memory problems bring "MaxPerWorker = 1" settings !!!

uaktags commented 5 years ago

I'm not exactly sure what you're meaning in your previous three messages, but I would say the answer definitely is not to increase your swap. Using swap means you've run out of RAM (and now using the slower drive-bound swap space for virtual-Memory). If there's an issue with a memory management problem, then that needs to be looked at, rather than adding more to the complexity by adding more resources for it to fail at later dates.

The testing i'm doing with Tx.distinct to get the missing blocks is working, but is far from an acceptable approach. At the very least it needs to be re-written/refractored. However it's a simple solution in the grand scheme of things, described below:

- An array is created for the numbers 1 until Stats.Last.
- Tx.Distinct("blockindex") is ran, during execution, we compare results with what's in the above array, remove the conflicting (existing) blocks, and are left with the "missing" blocks.
- From there, we spin up our cluster with using this array as the dataset for the blocks to sync over, and go from there.

uaktags commented 5 years ago

Okay https://github.com/uaktags/explorer/commit/244d08e310b75ba217a96db6a122389cc88773d5 and the one right before it should bring some more stabilty to the update function and the check function. However, in the future, i'm going to move check to "checkMissing" as my new method is completely different from the idea and purpose of the original check. The original check was basically a reindex that didn't delete anything, it merely started from block 0 and went all the way up grabbing every transaction. my new check (checkMissing) is searching for missing blocks rather than missing transactions.

vrfebet commented 5 years ago

Hi, this happened while running "check". is it possible to link to the previous search after interruption? When I ran "check" again after the fall, it started to regenerate the block sheet ... (though maybe that's better) new_check_error

vrfebet commented 5 years ago

and another one new_check_error2

uaktags commented 5 years ago

Unfortunately, the screenshots aren't going to be much help at all. You have your own edits in place from your fork (no pull request has been made) and no mention of which commit you're on. " added reindex to the refactor Tim Garrity committed 5 hours ago

starting to refractor, i wouldn't use it right now Tim Garrity committed 5 hours ago "

I hope you understand, but I really can't stop and go through each of the iterations of issues. This github issue is for testing the cluster, and I either need solutions provided or I need to continue coding through the issues that I run into locally. I'd recommend that if you find issues, create a new issue, try to post solutions and provide a pull request to fix it. However, also note that this is still a constantly updated / changed project, so I'd need to know what base commit you're working on in your issue.

Your screenshots already show two areas for you to look at: Line 437 you have an issue with the variable start in your Logger message. This line doesn't match up in either the current commit or the referenced one from my previous, so track that down. The second one shows that there's an issue with BlockHeight 11026 - 11027, and the tx that's queried. Track that down by using some logging to find out what's going on.

vrfebet commented 5 years ago

to eliminate the error caused by my edits - I used your original edits. log I don't have but the errors are seen in the video frame.

https://youtu.be/4OT1lHKodIg

I am interested in your project and my intention is to inform you about the testing of your product. So I'm returning to testing your original branch.

uaktags commented 5 years ago

I don't see an error, I see "done" at the end, this is both not the latest commit nor does it state which commit it is from. The issue you're showing in your video is that numWorkersNeeded shows 1, but it generated 4 jobs instead.

I appreciate that, and by all means test it, but without any help from testers, all it is is more unknown troubles. I really could use pull requests, ideas for solutions, identifying the problems in the code that are faced (the problem itself doesn't help much if I have to investigate without access to the system that's producing the error) so that ideas can be developed to fix it.

uaktags commented 5 years ago

Currently running a reindex to see if I got everything without missing any blocks. I'm not counting on it though. Expectations: Reindex will succesfully run through, but there will be missing blocks from timeouts. However, we'll only need to run check (future: checkMissing) one time for it to successfully grab all of those. Then one final update to finish the sync if needed.

uaktags / explorer

Testing for cluster #14