Multicore - Githubissues

hagarthehorrible1 commented 11 years ago

Dear Kramble: I was trying to run your code on my DE2-115 evaluation board. I managed to optimize it and I've got an output of 6.20 kHash/sec using 1024kBit that technically would allow me to instantiate 3 harshcores, so I could reach 18.6 kHash/sec without external memory. I also succeed to make it work the 512kBit version, but with your forecasted penalty of getting only 80% of the performance, so I've got 5.02 kHash/sec for a single core and I potentially could instantiate 7 harshcores in it, bringing my output to 35.14 kHash/sec, with a power consumption of around 7.5W. But when I tried your approach to instantiate more harshcore (I am trying just 2 at this point, as prove of concept) Quartus shows me that the compilation went through and allocated memory and logical elements doubled and the hierarchy shows all elements there, but when I run your miner script it still gives me 6.20 kHash/sec, making me believe that either the multicore instantiation is not kicking in or that your mine tcl is not capturing the extra jobs. I am not proficient with tcl so I was wondering if you could help me here. Also, although I am proficient with FPGA's, this is the first time I am using the virtual wires in a project. I tried different approaches for the queue you suggested, using a flag to alternate the golden_nonce_out that is produced. I run out of ideas. Is this the best way to contact you? Rgds, Hagar, the Horrible

kramble commented 11 years ago

Hi Hagar Good to hear some feedback, and I'm very interested in your optimizations, that's three times the throughput that I was getting! I'm assuming you're using the multicore example core I included as a comment in ltcminer.v. You'll see that it reports the nonce back to the tcl driver using the id "NONC", but only for one of the cores, so no matter how many cores you instantiate, it will still report the same hash rate (its calculated in mine.tcl from the rate of change of the NONC value). The golden_nonce is reported back under id "GNOC" which is the important one to get queued, though for the hash rates we get here the simple approach of just latching the most recent one found by any core should work fine (the mine.tcl script just looks for a change in the value then sends it back to the pool as a share claim). So assuming I'm correct you should be hashing at the expected rate, its just being mis-reported by the mining script. The way to be sure is to look at the share acceptance rate on the pool stats. At difficulty=32 1kHash/sec comes in at about 1.7 shares per hour (1000 * 3600 * 2048 / 2^32 where 2048 is 0x7ff, the target for diff=32). I'm pretty new to tcl too (more used to C programming), and I find it pretty weird. The actual code is fpgaminer's, with a few tweaks of my own for litecoin to send the full data header and target rather than midstate. I also added a test feature which I find pretty useful (eg for checking out overclocking) which sends getwork from a file rather than a live pool (there is one tiny bug in mine.tcl reporting errors, but I'll fix that later today).

My approach to multicore is pretty crude. If you look at ngzhang or teknohog's bitcoin miner code you'll see they divide up the nonce range algorithmically between cores, while I just hard code the top few bits (which is fine in practice as it does not matter which nonces we test, as long as there is no overlap between cores). Their serial comms does not report back the nonce anyway, so there is nothing for the driver to display here.

Looking at enhancements for multiple cores, we could try reducing the scratchpad further to 256kbit which would double the cores yet again. This might start to hit limits on available LEs so it would be useful to save some logic by sharing the PBKDF2_SHA256 between all the cores rather than having one engine each. There are also a lot of registers in this design, which we might be able to reduce somewhat eg by storing X0Save and X1Save in a scratchpad RAM slot rather than a register (though at the cost of extra interpolation for the lost locations).

Anyway, thanks again for the feedback. Its probably best to keep the discussion public via github, but you could PM me on the litecoin forum if you'd prefer a private message. I'm on UK time here, so there may be some delay in replies.

hagarthehorrible1 commented 11 years ago

Dear Kramble:

Thanks for your detailed reply. My reported output is using just one core. I instantiated a second core but did not test it because I was trusting the output report from your tcl file. Based on your information, I will try 6 cores and see the pool report. I will let you know if it works. Using 1024Kb for memory we can instantiate only 3 cores. Using 512Kb this number goes to 7, but then we are limited by logic elements, that allows us to instantiate only 6 cores. So your thought about sharing some logic elements might save enough space to instantiate another couple of cores. What type of penalty in performance you expect for the 256Kbit version compared to the 512Kbit? Another 20% down? This only would be worth the time if the sharing idea really improves the optimization of use of Logic Elements to a point where we could instantiate more than 9 cores... I am still working in some optimizations inside the FPGA, but I will be more than happy to share the final results with you when I have it concluded.

Rgds,

Hagar, the Horrible

kramble commented 11 years ago

256kBit ram should give the same result as LOOKUP_GAP=4 on a GPU. I remember seeing a table of these, but can't find it right now, so I'll have a go at calculating it as 1024 cycles (scratchpad build) + (1024 * (1 + 2 + 3 + 4) / 4) for the mix gives 3.5 * 1024 cycles compared with 2 * 1024 for a full scratchpad, so its 4/7 of the speed, but we have four cores, so overall 16/7 throughput compared with a full scratchpad. I'll look forward to seeing your optimizations, I'm pretty much an amateur with FPGA's so I'm not at all surprised you've significantly improved on my performance, but I'm happy to learn! Good luck.

hagarthehorrible1 commented 11 years ago

Kramble:

what I did was a trivial careful floorplanning on the instances of the cores. If you constrain each instance to a certain "real estate" on the chip, sometimes this optimize the effort of the tool on routing the signals. With a little experience you can work out some miracles. But the problem is that physics has limits...:)... My experience says that after you use more than 90 to 92% of the LEs of the chip you start observing strange behaviors. When I tried to instantiate the seventh core on DE2-115, the utilization reached 97% and the hell got loosed. The tool took 19 hours to route the signals and time was not met. So I believe that for this chip the limit might be 6 cores, unless there are some optimization on the algorithm routine, but that is just my guess. Regarding clock, I tried to stretch it up to 109MHz (chip's limit) using some in order not to miss any work, but the results were 100% rejected. I am doing some experiments to see which is the best balance on clock frequency, number of cores instantiated and rejections. I will keep you posted. And I still did not start playing with the SoC kit. I attended an Arrow workshop and they were giving the kit at US$ 99.00 (it was the launch promotion for the chip, in June of this year)... :) ... Now you can pre-order the kit at US$ 299.00 (still an excellent deal), but you will have to wait another 2 months to have it... :( ...

Rgds,

Hagar, the Horrible

kramble commented 11 years ago

Great, thanks for that info. I found out about the Arrow workshops just yesterday by googling the SoCkit, looks like you got a great deal. I had a look through the lab material, mostly about the A9 cores rather then the FPGA, and it goes way above my head. Arrow's EU site is listing them for back-order at GBP 214 (not quite such a good deal, but still good), but with a 12 week lead.

Regarding the 6kHash/sec from a single core. That would kind-of fit with109MHz, but I'm really surprised that my salsa-mix block will run at that speed as the propagation delay is reported at around 40nS. Did you custom place this to get the speedup, or just fully pipeline the stages (8 add/xor/shifts times four iterations gives clock 32 cycles per blockmix)?

kramble / FPGA-Litecoin-Miner

Multicore #1