tromp / cuckoo

a memory-bound graph-theoretic proof-of-work system
Other
818 stars 173 forks source link

Is the solver context thread safe? #65

Closed tianchaijz closed 5 years ago

tianchaijz commented 5 years ago

Hi John:

I build mean.cu as a shared library, and use cgo to call it. Create one goroutine for each graphic card, and binding one solver context for each goroutine.

One solver context works as expected, but multiple contexts not work well, after running a while, those solver contexts stopping working, and the log shows:

findcycles edges 46 time 4 ms total 4 ms
findcycles edges 46 time 3 ms total 3 ms
findcycles edges 46 time 3 ms total 3 ms
findcycles edges 46 time 4 ms total 4 ms
findcycles edges 46 time 3 ms total 3 ms
findcycles edges 46 time 3 ms total 3 ms
findcycles edges 46 time 3 ms total 3 ms
findcycles edges 46 time 16 ms total 16 ms
findcycles edges 46 time 4 ms total 4 ms

Seems that trimming phase is not work. Any ideas? Thanks!

tromp commented 5 years ago

I'm not sure how to interpret the output above. Are the 9 lines from 9 different contexts? Are you giving each one a different graph to work on (by varying the arguments to run_solver()) ? It's also puzzling why the trim took 0 ms in each case, and why there are always 46 edges left. Where is your code calling run_solver() ?

tianchaijz commented 5 years ago

Hi John,

Thanks for your reply,

I just create a simple function named find_solution:

 CALL_CONVENTION int find_solution(SolverCtx* ctx,
                                   char* header,
                                   int header_length,
                                   SolverSolutions *solutions)
 {
   if (ctx == NULL || !ctx->trimmer.initsuccess){
     print_log("Error initialising trimmer. Aborting.\n");
     print_log("Reason: %s\n", LAST_ERROR_REASON);
     return -__LINE__;
   }

   ctx->setheadernonce(header, header_length, 0);
   u32 nsols = ctx->solve();

   for (unsigned s = 0; s < nsols; s++) {
     print_log("Solution");
     u32* prf = &ctx->sols[s * PROOFSIZE];
     if (solutions != NULL){
       solutions->edge_bits = EDGEBITS;
       solutions->num_sols++;
       for (u32 i = 0; i < PROOFSIZE; i++)
         solutions->sols[s].proof[i] = (u64) prf[i];
     }

     int pow_rc = verify(prf, &ctx->trimmer.sipkeys);
     if (pow_rc != POW_OK) {
         return -__LINE__;
     }
   }

   return nsols;
 }

I give each context a different header, and called the solver this way:

var solverSols SolverSolutions
n := C.find_solution(ctx, header, headerLen, (*C.SolverSolutions)(unsafe.Pointer(&solverSols)))
if n > 0 {
    sols := make([]uint32, 42)
    for i := 0; i < int(n); i++ {
        for j := 0; j < 42; j++ {
            sols[j] = uint32(solverSols.sols[i].proof[j])
        }
    }
} else if n < 0 {
    return fmt.Errorf("cgo return code: %d", int(n))
}
tianchaijz commented 5 years ago

The full log context:

 GPU1 finding solution
 GPU4 finding solution
 GPU2 finding solution
 GPU0 finding solution
 GPU1 finding solution
 GPU4 finding solution
 GPU2 finding solution
 GPU0 finding solution
 eted in 63 + 46 ms
 Seeding completed in 63 + 46 ms
 Seeding completed in 63 + 46 ms
 Seeding completed in 63 + 46 ms
 Seeding completed in 63 + 46 ms
    2-cycle found
    8-cycle found
   54-cycle found
 findcycles edges 63077 time 56 ms total 390 ms
   12-cycle found
    4-cycle found
   20-cycle found
    4-cycle found
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=RgAAACQPgXA=
    6-cycle found
  100-cycle found
  406-cycle found
  594-cycle found
  340-cycle found
 findcycles edges 71378 time 49 ms total 386 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=RwAAACQPgXA=
   38-cycle found
   10-cycle found
   66-cycle found
   12-cycle found
 2466-cycle found
   70-cycle found
 findcycles edges 74681 time 58 ms total 383 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=QwAAACQPgXA=
  232-cycle found
  306-cycle found
 findcycles edges 61895 time 53 ms total 378 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=RAAAACQPgXA=
   20-cycle found
   74-cycle found
   28-cycle found
  332-cycle found
 findcycles edges 63438 time 51 ms total 379 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=SAAAACQPgXA=
   38-cycle found
 findcycles edges 64257 time 74 ms total 400 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=RQAAACQPgXA=
 findcycles edges 77 time 2 ms total 106 ms
 findcycles edges 77 time 2 ms total 137 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=TAAAACQPgXA=
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=TgAAACQPgXA=
 findcycles edges 77 time 1 ms total 189 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=SwAAACQPgXA=
 findcycles edges 77 time 1 ms total 280 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=SQAAACQPgXA=
 findcycles edges 77 time 0 ms total 365 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=TQAAACQPgXA=
 findcycles edges 77 time 3 ms total 432 ms
 findcycles edges 77 time 4 ms total 319 ms
 findcycles edges 46 time 4 ms total 234 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=SgAAACQPgXA=
 findcycles edges 77 time 4 ms total 319 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=UQAAACQPgXA=
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=VAAAACQPgXA=
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=UgAAACQPgXA=
 findcycles edges 46 time 2 ms total 2 ms
 findcycles edges 46 time 2 ms total 2 ms
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=VwAAACQPgXA=
 setheader: sKrQQF/FfgmOD8PO0OeVJSNq0ssZM7xJ/ANzAPWt9sQ=UAAAACQPgXA=
tromp commented 5 years ago

The earlier runs with edges in the 10s of thousands look fine, but the ones with less than 100 edges are wrong and spent 0ms on trimming. Perhaps you can more diagnostic output from the trimming routine to figure out why they skipped nearly all the work. For instance, SeedA should compute siphashes for 2^29 nodes which necessarily takes a lot of time. I think with all data encapsulated in the SolverCtx, the solver should be thread safe.

tianchaijz commented 5 years ago

Thanks John!

I'm continue debugging, will let you know if this problem solved!

tianchaijz commented 5 years ago

It seems that the edgetrimmer *dt point to invalid memory area.

651       int solve() {
(gdb) n
653         auto time0 = std::chrono::high_resolution_clock::now();
(gdb) n
655         trimmer.abort = false;
(gdb) n
656         u32 nedges = trimmer.trim();
(gdb) s
edgetrimmer::trim (this=0x7f3510000c00) at mean.cu:416
416       u32 trim() {
(gdb) n
417         cudaMemcpy(dt, this, sizeof(edgetrimmer), cudaMemcpyHostToDevice);
(gdb) p *dt
Cannot access memory at address 0x7f34cf400000
(gdb) p *this
$1 = {tp = {expand = 0, ntrims = 176, genA = {blocks = 4096, tpb = 256}, genB = {blocks = 4096,
      tpb = 128}, trim = {blocks = 4096, tpb = 512}, tail = {blocks = 4096, tpb = 1024}, recover = {
      blocks = 1024, tpb = 1024}}, dt = 0x7f34cf400000, sizeA = 4462739456, sizeB = 2852126720,
  indexesSize = 16384, bufferA = 0x7f331a600000, bufferB = 0x7f3424600000,
  bufferAB = 0x7f33c4600000, indexesE = 0x7f34cec00000, indexesE2 = 0x7f34cec04000, hostA = {72644,
    0 <repeats 4095 times>}, uvnodes = 0x7f34cf000000, sipkeys = {k0 = 3288269178733044074,
    k1 = 2872931202106657597, k2 = 11375450941595734051, k3 = 5158531013351406122},
  dipkeys = 0x7f34cf000200, abort = false, initsuccess = true}
(gdb) n
420         checkCudaErrors(cudaEventCreate(&startall)); checkCudaErrors(cudaEventCreate(&stopall));
(gdb) p *dt
Cannot access memory at address 0x7f34cf400000

I'm continue debugging ...

tromp commented 5 years ago

dear tianchaijz,

It seems that the edgetrimmer *dt point to invalid memory area.

651 int solve() { (gdb) n 653 auto time0 = std::chrono::high_resolution_clock::now(); (gdb) 655 trimmer.abort = false; (gdb) 656 u32 nedges = trimmer.trim(); (gdb) s edgetrimmer::trim (this=0x7f94c4000c00) at mean.cu:416 416 u32 trim() { (gdb) n 417 cudaMemcpy(dt, this, sizeof(edgetrimmer), cudaMemcpyHostToDevice); (gdb) p sizeof(edgetrimmer) $4 = 16536 (gdb) p *dt Cannot access memory at address 0x7f9479800000

I'm continue debugging ...

I noticed my cuckoo/mean.cu had a redundant cudaMemCpy. I removed that one in my latest commit. Don't see how that could be responsible for behaviour you saw though...

regards, -John

tianchaijz commented 5 years ago

Hi John,

Still not solved, really weird. I'll try newer nvidia driver ...

tianchaijz commented 5 years ago

Hi John,

This problem is solved.

Since goroutine may be scheduled running on different OS thread, which lead this problem, I add runtime.LockOSThread(), my program has been running hours, all look fine now.

tromp commented 5 years ago

I still don't understand how running on different OS threads explains the faulty behaviour. But I'm happy to hear you found a solution!

tianchaijz commented 5 years ago

Thank you again, John!