[X86] atomicrmw seq_cst lowering incorrect?

hidva commented 2 years ago

$ clang --version
clang version 12.0.1 (Fedora 12.0.1-1.fc34)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

#include <atomic>

void f(std::atomic_uint64_t* p) {
    p->fetch_add(0, std::memory_order_seq_cst);
}

$ clang -O1 -mllvm -opt-bisect-limit=40 -c -o 1.o 1.cc
BISECT: running pass (1) Simplify the CFG on function (_Z1fPSt6atomicImE)
BISECT: running pass (2) SROA on function (_Z1fPSt6atomicImE)
BISECT: running pass (3) Early CSE on function (_Z1fPSt6atomicImE)
BISECT: running pass (4) Simplify the CFG on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (5) SROA on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (6) Early CSE on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (7) Infer set function attributes on module (1.cc)
BISECT: running pass (8) Interprocedural Sparse Conditional Constant Propagation on module (1.cc)
BISECT: running pass (9) Called Value Propagation on module (1.cc)
BISECT: running pass (10) Global Variable Optimizer on module (1.cc)
BISECT: running pass (11) Promote Memory to Register on function (_Z1fPSt6atomicImE)
BISECT: running pass (12) Promote Memory to Register on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (13) Dead Argument Elimination on module (1.cc)
BISECT: running pass (14) Combine redundant instructions on function (_Z1fPSt6atomicImE)
BISECT: running pass (15) Simplify the CFG on function (_Z1fPSt6atomicImE)
BISECT: running pass (16) Combine redundant instructions on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (17) Simplify the CFG on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (18) Remove unused exception handling info on SCC (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (19) Deduce function attributes on SCC (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (20) SROA on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (21) Early CSE w/ MemorySSA on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (22) Simplify the CFG on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (23) Combine redundant instructions on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (24) Simplify the CFG on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (25) Reassociate expressions on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (26) Simplify the CFG on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (27) Combine redundant instructions on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (28) SROA on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (29) MemCpy Optimization on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (30) Sparse Conditional Constant Propagation on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (31) Bit-Tracking Dead Code Elimination on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (32) Combine redundant instructions on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (33) Aggressive Dead Code Elimination on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (34) Simplify the CFG on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (35) Combine redundant instructions on function (_ZNSt13__atomic_baseImE9fetch_addEmSt12memory_order)
BISECT: running pass (36) Remove unused exception handling info on SCC (_Z1fPSt6atomicImE)
BISECT: running pass (37) Deduce function attributes on SCC (_Z1fPSt6atomicImE)
BISECT: running pass (38) SROA on function (_Z1fPSt6atomicImE)
BISECT: running pass (39) Early CSE w/ MemorySSA on function (_Z1fPSt6atomicImE)
BISECT: running pass (40) Simplify the CFG on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (41) Combine redundant instructions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (42) Simplify the CFG on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (43) Reassociate expressions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (44) Simplify the CFG on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (45) Combine redundant instructions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (46) SROA on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (47) MemCpy Optimization on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (48) Sparse Conditional Constant Propagation on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (49) Bit-Tracking Dead Code Elimination on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (50) Combine redundant instructions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (51) Aggressive Dead Code Elimination on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (52) Simplify the CFG on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (53) Combine redundant instructions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (54) Remove unused exception handling info on SCC (<<null function>>)
BISECT: NOT running pass (55) Deduce function attributes on SCC (<<null function>>)
BISECT: NOT running pass (56) Deduce function attributes in RPO on module (1.cc)
BISECT: NOT running pass (57) Global Variable Optimizer on module (1.cc)
BISECT: NOT running pass (58) Dead Global Elimination on module (1.cc)
BISECT: NOT running pass (59) Float to int on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (60) Loop Distribution on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (61) Loop Vectorization on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (62) Loop Load Elimination on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (63) Combine redundant instructions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (64) Simplify the CFG on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (65) Optimize scalar/vector ops on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (66) Combine redundant instructions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (67) Warn about non-applied transformations on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (68) Alignment from assumptions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (69) Strip Unused Function Prototypes on module (1.cc)
BISECT: NOT running pass (70) Remove redundant instructions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (71) Hoist/decompose integer division and remainder on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (72) Simplify the CFG on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (73) Merge contiguous icmps into a memcmp on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (74) Expand memcmp() to load/stores on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (75) Constant Hoisting on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (76) Partially inline calls to library functions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (77) X86 Partial Reduction on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (78) CodeGen Prepare on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (79) X86 DAG->DAG Instruction Selection on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (80) Local Dynamic TLS Access Clean-up on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (81) X86 Domain Reassignment Pass on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (82) Early Tail Duplication on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (83) Optimize machine instruction PHIs on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (84) Remove dead machine instructions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (85) Early If-Conversion on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (86) X86 cmov Conversion on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (87) Early Machine Loop Invariant Code Motion on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (88) Machine Common Subexpression Elimination on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (89) Machine code sinking on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (90) Peephole Optimizations on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (91) Remove dead machine instructions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (92) Live Range Shrink on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (93) X86 LEA Optimize on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (94) X86 Optimize Call Frame on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (95) X86 Avoid Store Forwarding Blocks on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (96) Two-Address instruction pass on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (97) Machine Instruction Scheduler on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (98) Stack Slot Coloring on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (99) Machine Copy Propagation Pass on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (100) Machine Loop Invariant Code Motion on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (101) Fixup Statepoint Caller Saved on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (102) PostRA Machine Sink on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (103) Shrink Wrapping analysis on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (104) Control Flow Optimizer on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (105) Tail Duplication on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (106) Machine Copy Propagation Pass on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (107) Post RA top-down list latency scheduler on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (108) Branch Probability Basic Block Placement on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (109) X86 Execution Dependency Fix on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (110) BreakFalseDeps on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (111) X86 Byte/Word Instruction Fixup on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (112) X86 Atom pad short functions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (113) X86 LEA Fixup on function (_Z1fPSt6atomicImE)

$ objdump -M intel -d 1.o

1.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <_Z1fPSt6atomicImE>:
   0:   0f ae f0                mfence
   3:   48 8b 07                mov    rax,QWORD PTR [rdi]
   6:   c3                      ret

Incorrect code generation:

$ clang -O1 -mllvm -opt-bisect-limit=41 -c -o 1.o 1.cc
$ objdump -M intel -d 1.o

1.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <_Z1fPSt6atomicImE>:
   0:   f0 83 4c 24 c0 00       lock or DWORD PTR [rsp-0x40],0x0   # incorrect code generation !!!
   6:   c3                      ret

hidva commented 2 years ago

another example:

#include <atomic>
void f(std::atomic_uint64_t* p) {
    p->fetch_add(0, std::memory_order_relaxed);
}

$ clang -O1 -mllvm -opt-bisect-limit=40 -c -o 2.o 2.cc
$ objdump -M intel -d 2.o
0000000000000000 <_Z1fPSt6atomicImE>:
   0:   0f ae f0                mfence
   3:   48 8b 07                mov    rax,QWORD PTR [rdi]
   6:   c3                      ret

$ clang -O1 -mllvm -opt-bisect-limit=41 -c -o 2.o 2.cc
...
BISECT: running pass (40) Simplify the CFG on function (_Z1fPSt6atomicImE)
BISECT: running pass (41) Combine redundant instructions on function (_Z1fPSt6atomicImE)
BISECT: NOT running pass (42) Simplify the CFG on function (_Z1fPSt6atomicImE)
...
$ objdump -M intel -d 2.o
0000000000000000 <_Z1fPSt6atomicImE>:
   0:   48 8b 07                mov    rax,QWORD PTR [rdi]
   3:   c3                      ret

LebedevRI commented 2 years ago

I suspect it's NOTABUG.

Could you please be more specific as to what is incorrect? x + 0 and x | 0 is the same thing, and gcc produces the former: https://godbolt.org/z/jz46GKds9

hidva commented 2 years ago

https://godbolt.org/z/jz46GKds9

I think it should output lock or dword ptr [rdi], 0, not lock or dword ptr [rsp - 64], 0

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-x86

topperc commented 2 years ago

The OR is writing to a location on the stack, not the original variable. It's just trying to create a memory ordering barrier. Pretty sure it's hitting this code in lowerAtomicArith from X86ISelLowering.cpp

  // Specialized lowering for the canonical form of an idemptotent atomicrmw.    
  // The core idea here is that since the memory location isn't actually         
  // changing, all we need is a lowering for the *ordering* impacts of the       
  // atomicrmw.  As such, we can chose a different operation and memory          
  // location to minimize impact on other code.                                  
  if (Opc == ISD::ATOMIC_LOAD_OR && isNullConstant(RHS)) {                       
    // On X86, the only ordering which actually requires an instruction is       
    // seq_cst which isn't SingleThread, everything just needs to be preserved   
    // during codegen and then dropped. Note that we expect (but don't assume),  
    // that orderings other than seq_cst and acq_rel have been canonicalized to  
    // a store or load.                                                          
    if (AN->getSuccessOrdering() == AtomicOrdering::SequentiallyConsistent &&    
        AN->getSyncScopeID() == SyncScope::System) {                             
      // Prefer a locked operation against a stack location to minimize cache    
      // traffic.  This assumes that stack locations are very likely to be       
      // accessed only by the owning thread.                                     
      SDValue NewChain = emitLockedStackOp(DAG, Subtarget, Chain, DL);           
      assert(!N->hasAnyUseOfValue(0));                                           
      // NOTE: The getUNDEF is needed to give something for the unused result 0. 
      return DAG.getNode(ISD::MERGE_VALUES, DL, N->getVTList(),                  
                         DAG.getUNDEF(VT), NewChain);                            
    }

hidva commented 2 years ago

Thanks

llvm / llvm-project

[X86] atomicrmw seq_cst lowering incorrect? #54104