kramble / FPGA-Litecoin-Miner

A litecoin scrypt miner implemented with FPGA on-chip memory.
GNU General Public License v3.0
277 stars 125 forks source link

LX150 possible issue #2

Closed razorfish-sl closed 10 years ago

razorfish-sl commented 11 years ago

Ok , ran the Lx150 code on a single device , but set it to one internal core.

The ' ./ltcminer.py' does not seem to return a valid hash every time when testing. (but it IS correct when it does) (ignore the kh/s, they are nonsense..... I wish they were not)

./ltcminer.py Miner started on Sun Aug 11 07:19:17 2013 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Sun Aug 11 07:19:26 2013 nonce 0000318f Share found on Sun Aug 11 07:19:26 2013 nonce 0000318f Upstream result: False [0 accepted, 1 failed, 199.73 +/- 199.73 khash/s] Upstream result: False [0 accepted, 2 failed, 398.73 +/- 281.95 khash/s] Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Sun Aug 11 07:20:38 2013 nonce 0000318f Share found on Sun Aug 11 07:20:38 2013 nonce 0000318f Upstream result: False [0 accepted, 3 failed, 76.41 +/- 44.11 khash/s] Upstream result: False [0 accepted, 4 failed, 101.85 +/- 50.92 khash/s] Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Sun Aug 11 07:26:21 2013 nonce 0000318f Upstream result: False [0 accepted, 5 failed, 24.68 +/- 11.04 khash/s] Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Sun Aug 11 07:26:30 2013 nonce 0000318f Share found on Sun Aug 11 07:26:30 2013 nonce 0000318f Upstream result: False [0 accepted, 6 failed, 28.94 +/- 11.81 khash/s] Upstream result: False [0 accepted, 7 failed, 33.76 +/- 12.76 khash/s] Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Sun Aug 11 07:26:40 2013 nonce 0000318f Upstream result: False [0 accepted, 8 failed, 37.75 +/- 13.35 khash/s] Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Sun Aug 11 07:29:36 2013 nonce 0000318f Share found on Sun Aug 11 07:29:36 2013 nonce 0000318f Upstream result: False [0 accepted, 9 failed, 30.44 +/- 10.15 khash/s] Upstream result: False [0 accepted, 10 failed, 33.82 +/- 10.70 khash/s] Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Sun Aug 11 07:29:46 2013 nonce 0000318f Share found on Sun Aug 11 07:29:46 2013 nonce 0000318f Upstream result: False [0 accepted, 11 failed, 36.62 +/- 11.04 khash/s] Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Upstream result: False [0 accepted, 12 failed, 39.94 +/- 11.53 khash/s] Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Sun Aug 11 07:30:16 2013 nonce 0000318f Share found on Sun Aug 11 07:30:16 2013 nonce 0000318f Upstream result: False [0 accepted, 13 failed, 41.29 +/- 11.45 khash/s] Upstream result: False [0 accepted, 14 failed, 44.46 +/- 11.88 khash/s] Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Sun Aug 11 07:31:07 2013 nonce 0000318f Share found on Sun Aug 11 07:31:07 2013 nonce 0000318f Upstream result: False [0 accepted, 15 failed, 44.21 +/- 11.42 khash/s] Sending data to FPGA Payload 000007ff000000007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Upstream result: False [0 accepted, 16 failed, 47.16 +/- 11.79 khash/s]

kramble commented 11 years ago

Thanks for testing this. The kh/s will indeed be nonsense as its reporting the average for actual shares found (unlike the Altera JTAG mine.tcl which monitors the internal nonce counter), so when sending it test data its going to be way overestimating the rate.

It looks like the serial interface is misbehaving as each test payload should instantly [ADDENDUM after 30 seconds, see below] return a match, but it just seems random here. Possibly the baudrate timing is off slightly, anyway I'll look into it. I can do some limited testing on my LX9 board (using a dummy hasher as the full one won't fit), but I only did this at 4800 baud as this is how my setup is currently configured (long story involving a slow opto-isolator circuit), I'll modify it to run at 115200 and see if I can replicate the problem.

Alternatively it could just be the startup nonce value. I noticed on the Altera JTAG interface that I needed to subtract a few from the test nonce before it would match. I haven't got to the bottom of this yet (just fudged it in jtag_comm.tcl) but you may want to modify the test data slightly (change the nonce from 318f to 3100) and see if that fixes it. No, sorry, thinking out loud. Its set to 00000000 in the test payload anyway, so its NOT going to return the match instantly, in fact it should delay around 30 seconds if its running at 25MHz. I'll change that and put a sensible starting nonce in there [ADDENDUM actually it shouldn't be working at all with the current timings, as ltcminer.py will send new work before it has a chance to reach the target nonce. So we may have two bugs interacting to produce a valid result! Bung 3100 in there and see whether it starts returning the match instantly]

And on the third hand, it could be a clock speed issue, though 25MHz shouldn't be stressing it. Or maybe my clock-domain crossing logic is bad (its the first time I've tried this).

Anyway it just goes to show how difficult it is to do a blind-port without having the target board in-hand to test on. Perhaps I'll get myself a LX75 dev board after all (no point getting a LX150 at this stage as I'll only have a 30 day window to play with it).

Thanks again, I'll get to work on this.

razorfish-sl commented 11 years ago

yep it does delay before the nonce is returned the first time, and it ALWAYS works the first time, (i uncommented the debug lines as you can see) It may be worth disabling the code for multicore and secondary FPGA whilst it gets debugged. Then once it is solid re-enable the multicore and finally the secondary FPGA.

It is not the 25Mhz, it can safely route at 35Mhz , there are NO timing failures...

kramble commented 11 years ago

OK, good. If you're running at 35MHz it explains why we're seeing a match so quickly. Its probably just on the 20 second askrate (which may be a lucky coincidence!). Can you try it with the 3100 starting nonce. Just put it in the test_payload, change it so it ends "...717e00310000ff070000" (I think that's right, not tested it yet).

Actually, thinking back, I did see some odd behavior in my LX9 testing. The first work would match, but then I'd need to send work with a different starting nonce in order to get it to match again. This is the default behavior for the altera port (since the virtual wire does not provide a strobe to indicate new data, I just look for the nonce value changing). Perhaps my clock-crossing is indeed at fault here and the loadnonce strobe is not working (its only a test feature, during live mining the actual nonce value is irrelevant).

OK, found it. I was being an idiot. The uart runs at 100Mhz but the rx_done strobe is only set for one clock cycle. Then I sync this to the hash_clk at 25/35 MHz. You see the problem, One in four chance of seeing the strobe! Hence the apparent randomness of the hash results. It won't affect live mining though. I'll go back and re-read the tutorial on clock crossing, its got just the code I need to do this properly (I only read it after I'd written my code, and being lazy didn't bother to go back and implement it properly).

Fixed that (pending simulation and testing on LX9), pretty straightforward. Hub core is more challenging (same problem crossing clock domains, but in reverse), but explains why you were seeing TWO matches in a row. Its done properly in ngzhang's code so I could just copy that (at risk of breaking GPL since he's got some weird license disclaimer on there), better to roll my own I think. I'm almost tempted to ditch the uart_clk and just use hash_clk throughout (that's what the original bitcoin miner serial code did, so I'm not sure why teknohog changed it). It does seem a bit of a cop out though (its not like its a complicated thing to get clock crossing done right), but it will simplify the DCM too so I think I will do that. AHA teknhog didn't change it, it was ngzhang. That clinches it then, the uart_clk is going.

kramble commented 11 years ago

Just pushed a new version of the LX150 code with everything clocked by hash_clk. Compiled it on LX75 up to starting global placement (no point trying to get it to route), no errors. Tested the serial interface on LX9 at 4800 baud and it looks good (I use a hacked scrypt algorithm which only omits the salsa mixing step, so its a fair test). Currently trying to get ISE to compile it at 115200 baud but its having a strop with the routing. It really is pants, one tiny change and sorry, can't route that.

Well it compiled at 57600 baud, but its producing garbage. Could just be my serial I/O connection to the raspberry pi (there was a good reason I was running at 4800 baud using an opto-isolator to do the level translation) as I can't see why the fpga code would be broken at 57600 baud but OK at 4800. I'll look at it further tomorrow as its getting late here.

razorfish-sl commented 11 years ago

lol, it is not a small change, the design has suddenly gone from 3 different CLK domains to the WHOLE design having to meet a single timing constraint.

Usually It is not a good Idea to use a single clocking frequency in this situation.Later if you need to use a variable PLL to reduce down errors but increase nonce production, then it will break the UART code.

I had a hell of a problem with a single source clock on BITCOIN code, because the whole massive design had to meet a single timing constraint

razorfish-sl commented 11 years ago

looking better....... (KH/s timings are obviously out....)

./ltcminer.py Miner started on Mon Aug 12 08:00:15 2013 Sending data to FPGA Payload 000007ff000031007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Mon Aug 12 08:00:17 2013 nonce 0000318f Upstream result: False [0 accepted, 1 failed, 948.82 +/- 948.82 khash/s] Sending data to FPGA Payload 000007ff000031007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Mon Aug 12 08:00:18 2013 nonce 0000318f Upstream result: False [0 accepted, 2 failed, 1326.97 +/- 938.31 khash/s] Sending data to FPGA Payload 000007ff000031007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Mon Aug 12 08:00:19 2013 nonce 0000318f Upstream result: False [0 accepted, 3 failed, 1537.43 +/- 887.64 khash/s] Sending data to FPGA Payload 000007ff000031007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Mon Aug 12 08:00:19 2013 nonce 0000318f Upstream result: False [0 accepted, 4 failed, 1648.15 +/- 824.07 khash/s] Sending data to FPGA Payload 000007ff000031007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Mon Aug 12 08:00:20 2013 nonce 0000318f Upstream result: False [0 accepted, 5 failed, 1718.79 +/- 768.67 khash/s] Sending data to FPGA Payload 000007ff000031007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Mon Aug 12 08:00:21 2013 nonce 0000318f Upstream result: False

It was a complete B*****rd to route, because now all the logic has to match the same clock rate.

And since it all has to match.. it pushed the timings down on the hash core.

I think that maybe a small double clocked FIFO for the coms may be in order (core wizard) that would allow valid nonces to be shoved in but clocked out at a different speed.. Or return back to the Nzang code

kramble commented 11 years ago

Thanks, lol, at least I didn't break it! I'm still pretty much a novice at this, and your advice is very much appreciated. I'll go back to using separate clock domains. The FIFO idea looks interesting, though it seems a bit overkill. Anyway at least I know the code does work on LX150.

The multicore approach does seem dead in the water, given the routing problems (would using separate clock regions or even DCMs for each core help?). Perhaps we need to look at using a deeper ram, say 4k by 1024 and a single fully pipelined salsa blockmix, pushing 8 streams of the lookahead_gap=2 algorithm through a single core? Though that's pretty complicated, best to start with 4 streams of the full scratchpad and get that working first. I was just starting to look at this in my experimental branch (though just using 512kbit ram on the DE0-Nano), but I'll put some more effort into this now.

razorfish-sl commented 11 years ago

trying to increase the clock rate resulted in the design failing to route after 10 hours (normally it is done in under 1 hour)..., so a single CLK with that Verilog is not an option.

Plan ahead partitioning does not help either....(other than keeping the UART code out of the way)

FIFO is only overkill if you don't have the resources spare(unused FPGA resources are a waste), but if you take a look at the emailed link (the UART one), you will see stable domain crossing implemented without a FIFO. Hard coding a small distributed ram FIFO is not a big deal. It may be worth looking at breaking up the massive nets coming out of the UART RX into 8-16 bits, it might allow 'stacking' of a secondary reset job signal for block changes, it might also fix the inability to route the design. (it is easier to clock align 8-16 signal bits rather than 670…….)

The 'Dethstar', ....The yellow is the Uart RX compared to the rest of the design…... uart

The 'effect' of 670 outputs on routing………. uart2

The extra 41 'unload' clk cycles required are negligible compared to the delays inferred by the UART… ESP. as each 8/16 bits can be transferred whilst waiting for the next...

Multicore is fine, but the issue is the physical layout of ram on these _devices. (_note it is highly dependent on chip selection) Since the ram is physically arranged in strips, the router packs the cores close together, but then has to insert massive routes to link up the ram(ram1-4), as much as 57% is wasted on the routing. the other issue is ram 'mismatch', each FPGA has a particular bit grouping it likes to work with… I.E X9, X18,X36,X72 If you work outside the natural structure of the physical ram design.. then the tools have to do 'stupid' things with the logic to get what you want

kramble commented 11 years ago

Thanks, great insight, I was already using the uart rx/tx modules from that link (actually teknohog's fork) rather than the fpga4fun ones in the older fpgaminer code, but not the top-level comm_uart controller. I'll look to get that integrated in my code (I can't get the core wizard to work for me ATM, probably just need to rerun the installer to enable the IP). For the moment I've just done the clock crossing on the strobes (loadnonce and golden_nonce_match), which will be fine timing-wise as the data is stable whole clock cycles in advance, but of course the timing analyzer does not know that and is going to try to route them within whatever default constraint it uses. I guess it must be possible to tell it "don't care" in the UCF somehow.

I'm going to try to concentrate on the salsa mix pipelining. Some interesting info from hagar over on the Cyclone V thread, hopefully he'll come through with some detail on his mods.

[EDIT] Further thoughts on the uart (input). The only thing that actually belongs in the uart_clk domain is the uart_receiver (single byte output). Everything else (the assembly of the 672 bit input data) really belongs in the hash_clk domain, as that is its destination. This actually fits nicely with your FIFO suggestion of piping the data 8 bits at a time between the clock domains. The trouble with that analysis is that it looks almost exactly like the single clock domain that we have seen does not route! So the problem is not with the uart itself, but instead with attaching the (almost static) 672 input bits onto the hasher core. Now that stumps me as we've seen that putting this input into a separate clock domain helps the router, even though it does not logically belong there! So what is it "thinking" if when 672 bits is being clocked between the domains, it can route it, but when its all in the same domain it can't? This really belongs in the "expert" domain of coaxing the desired performance out of a flawed tool (voodoo is a term I've used before for optimizing ISE, and I'll repeat it again here).

razorfish-sl commented 11 years ago

The 672 wires are connected to the UART, and then loaded from 672 registers which had to be routed within the single clk timings, thats why I think it was failing. (as opposed to 672 registers routed at a lower clk rate )

I also played about with the instigation of the 'rams' for Xilinx (xilinx_ram.v) and even though it is not supposed to be the 'optimal' way (according to Xilinx), I found out that for this design it is... If you actually instigate the rams the way Xilinx says is the 'fastest' (pipelined registers), then it knocks a good 10MHZ off the max frequency...

It appears that if 'q' is registered and 'absorbed' into the ram block, it is 10mhz slower than if 'raddr_reg' is absorbed!!!, I suspect it is again because 'q' would generate 256 registers in the Xilinx optimization. So...it appears large groups of registers are bad!!!

As regards hagar getting 6kh/s by constraining the design... really that should be the last resort, otherwise every little change you make will need to be re-constrained..... which is something that cannot be done or verified in simulation. Meaning that the development cycle goes out the window.

kramble commented 11 years ago

Yeah, I see where you're coming from with the uart. Its complete overkill to clock at 100MHz. I think ngzhang used a div2 on the osc clock to 50MHz, which I stupidly took out on my port (I was just doing a quick'n'dirty to get something up on github, and didn't want to use his exact same DCM code due to worries about GPL, so based it on teknohog's instead). Even so, the single clock domain is much slower, which ought to help, but of course does not. I'll have a go with the 8 bit datapath idea and see what timings it comes up with (I can test compile on a LX75 part which is probably close enough). But I'm really coming to the conclusion that the router results are just totally random, and making a trivial change can make a huge difference (it probably just changes the seed placement). I need to get my head around the floorplanning, at least that allows the creation of different seed placements without having to tweak the design logic arbitrarily.

The ram design was just blind luck. I couldn't get the core generator wizard to work, so I just googled for alternatives and that design came up, which exactly matched what I was doing in the Altera port.

razorfish-sl commented 11 years ago

Floor planning is unlikely to work...... to any great extent...... The main problem (in the case of this Xilinx chip), is the distance between the rams. ram

you can see the main salsa code in the middle, which has then spread out in a star shape to encompass the rams it needs. (distributed may give slightly better results) or some sort of salsa partitioning to break the direct interrelation with the rams.

And here a single route for the feedback data......

feedback

Hence the 'slow' speed of the code....

kramble commented 11 years ago

Yeah. but I'm just thinking about different seed values for the placement, not manually placing the ram and salsa logic. Altera quartus supports setting a seed value, but I haven't found something similar in Xilinx yet. The reason why I'm thinking this way is my LX9 complilations at different baud rates. The change to the logic is utterly trivial (just some register taps), bit it makes a huge (and random) difference to the time taken to route the design. But I'm just a newbie here, so I guess I could be completely wrong about this.

EDIT. That's not to say that changing the architecture won't help. I've done some experiments with subsuming some of the registers into ram (it just costs a few cycles in extra interpolation), which may help with those huge register widths. Lots to play with. We may make some progress yet.

razorfish-sl commented 11 years ago

Xilinx warn about this... because it can change between different versions of the tools...... and yes it 'appears' completely random, I once gained a 25% speed increase by adding a "chipscope" core to a bitcoin design...., so ...I left it in.....

What I think you are talking about as regards 'seeds' is called 'costings' on Xilinx, you tell the tools to cost up each route, then modify the costings for the ones you want to be faster.

kramble commented 11 years ago

Then Xilinx are just full of s**t. It seems the tools haven't improved that much in the 20 years since I was involved in the ASIC biz. If its results are just randomly based on a seed, then iterate it until you get the result you want (its probably what smartexplorer is really doing behind the scenes). That cheers me up no end. There is promise there (and it explains ngzgang's comment on his github about running smartexplorer 100 times to get a good placement). Shame it will take months on a single PC though!

AHA. Thanks. I've seen mention of costing tables, so that explains where it fits in. So much to learn here (honestly, if it wasn't a hobby I'd have called it hard work :-)

razorfish-sl commented 11 years ago

Yep, Xilinx don't even give you the decency of a 'reach round', smart explorer is a pain as well....... But it is something you can stick on a CV....

Anyway.. just got your new code compiled and tested on a single core...

./ltcminer.py Miner started on Wed Aug 14 11:50:44 2013 Sending data to FPGA Payload 000007ff000031007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Wed Aug 14 11:50:45 2013 nonce 0000318f Upstream result: False [0 accepted, 1 failed, 1265.04 +/- 1265.04 khash/s] Sending data to FPGA Payload 000007ff000031007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Wed Aug 14 11:50:46 2013 nonce 0000318f Sending data to FPGA Payload 000007ff000031007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Wed Aug 14 11:50:47 2013 nonce 0000318f Upstream result: False [0 accepted, 2 failed, 1203.63 +/- 851.10 khash/s] Sending data to FPGA Payload 000007ff000031007e71441b141fe951b2b0c7dfc791d4646240fc2a2d1b80900020a24dc501ef1599fc48ed6cbac920af75575618e7b1e8eaf0b62a90d1942ea64d250357e9a09c063a47827c57b44e01000000 Share found on Wed Aug 14 11:50:48 2013 nonce 0000318f Upstream result: False [0 accepted, 3 failed, 1728.98 +/- 998.23 khash/s]

The serial code: /input_copy is currently sitting at 9.559ns with 92% of that as routing..... actual logic delay is 0.763ns..... /input_buffer is at 9.388ns with 91% of that as routing

So .... that is an 'only just' under 10ns....

kramble commented 11 years ago

I'll get the DCM changed to drop it to 50MHz. No point in overclocking the uart 8)

EDIT I've actually used CLKDV_DIVIDE(8.0) to give 12.5MHz which should be plenty for 115200 baud. Unless you can see a problem with doing this?

razorfish-sl commented 11 years ago

Normally for UART you can sample at 16 X the bit rate...., so you don't miss any of the edges., as long as the constraints are updated as well.... The only outstanding issue would be related to a further nonces coming in whilst one is being transmitted back.

In most bitcoin code, the new nonce was just thrown away!!!