a low-frequency get may got wrong item

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. run memcached, set tail_repair_time to help quick reproduce:
[21:13:01]~/prog/memcached-1.4.22 $ ./memcached -m 5000 -o tail_repair_time=10

2. run the program below(mc.go), line 39 show never print but it does.
  1 package main
  2 
  3 import (
  4         "bytes"  
  5         "flag"
  6         "fmt"
  7         "runtime"
  8         "time"
  9 
 10         "github.com/bradfitz/gomemcache/memcache"
 11 )
 12 
 13 var addr = flag.String("addr", "localhost:11211", "addr of memcached")
 14 var sleep = flag.Int("sleep", 4, "time to slee")
 15 
 16 func main() {
 17 
 18         flag.Parse()
 19 
 20         runtime.GOMAXPROCS(runtime.NumCPU())
 21 
 22         abc := memcache.New(*addr)
 23         def := memcache.New(*addr)
 24 
 25         abc.Set(&memcache.Item{
 26                 Key:   "ctw9tjk",
 27                 Value: []byte("abc"),
 28         })
 29 
 30         for _ = range time.Tick(time.Duration(*sleep) * time.Second) {
 31                 go func() {
 32                         fmt.Println("start get")
 33                         item, err := abc.Get("ctw9tjk")
 34                         if err != nil {
 35                                 fmt.Println("abc.Get", err)
 36                                 return
 37                         }
 38                         if !bytes.Equal(item.Value, []byte("abc")) {
 39                                 fmt.Printf("get ctw9tjk => {%v, %v}!!!!!!!!!!!!!!\n", item.Key, string(item.Value))
 40                                 return
 41                         }
 42                         fmt.Printf("get ctw9tjk => {%v, %v}\n", item.Key, string(item.Value))
 43                 }()
 44 
 45                 go func() {
 46                         fmt.Println("start set")
 47                         err := def.Set(&memcache.Item{
 48                                 Key:   "cqv87p7",
 49                                 Value: []byte("def"),
 50                         })
 51                         fmt.Println("def.Set", err)
 52                 }()
 53         }
 54 }

What is the expected output? What do you see instead?
line 39 show never print, but it does:
[21:38:07]~/code/go $ ./mc -sleep=13
start get
start set
get ctw9tjk => {ctw9tjk, abc}
def.Set <nil>
start get
start set
def.Set <nil>
get ctw9tjk => {ctw9tjk, abc}
start set
start get
def.Set <nil>
get ctw9tjk => {ctw9tjk, abc}
start get
start set
def.Set <nil>
get ctw9tjk => {cqv87p7, def}!!!!!!!!!!!!!!

What version of the product are you using? On what operating system?
1.4.22, Darwin Kernel Version 14.1.0 on mac.

Please provide any additional information below.
the logic here exactly hitted: 
https://github.com/memcached/memcached/blob/1.4.22/items.c#L150

Original issue reported on code.google.com by zarcardfly@gmail.com on 28 Mar 2015 at 1:59

GoogleCodeExporter commented 8 years ago

I try to understand the process, tell me if i misunderstood:
1. a "get" operation hit: 
https://github.com/memcached/memcached/blob/1.4.22/memcached.c#L2961 , the 
state is it->refcount > 1 and it->time is not updated.
2. at the same time, a "set" operation is going and hit 
https://github.com/memcached/memcached/blob/1.4.22/items.c#L150, the "search" 
item happens to be the "get" item.
3. the "search" item's refcount force to be 1 and free by slabs_free!
4. the "set" operation then call slabs_alloc and get the same item.
5. the "set" operation update(overlayed) the item.
6. the "get" operation response the overlayed item to requester.

Original comment by zarcardfly@gmail.com on 28 Mar 2015 at 2:28

GoogleCodeExporter commented 8 years ago

That is the bug with tail_repair_time, and is why we turned it off by default 
in newer versions. It's only supposed to exist to work around a legitimate 
refcount leak, which we haven't had in a while.

Do you have this bug outside of tail_repair_time?

Original comment by dorma...@rydia.net on 28 Mar 2015 at 5:36

GoogleCodeExporter commented 8 years ago

No, thanks for your reply.

Original comment by zarcardfly@gmail.com on 29 Mar 2015 at 3:11

GoogleCodeExporter commented 8 years ago

ok, closing this.

Original comment by dorma...@rydia.net on 29 Mar 2015 at 4:44

Changed state: Invalid

GoogleCodeExporter commented 8 years ago

The help tips say that tail_repair_time is default to be 3 hours, it may need 
to be fixed, and maybe add a warning.

[10:43:34]~/prog/memcached-1.4.22 $ ./memcached -h | grep -A 2 
'tail_repair_time'
              - tail_repair_time: Time in seconds that indicates how long to wait before
                forcefully taking over the LRU tail item whose refcount has leaked.
                The default is 3 hours.

Original comment by zarcardfly@gmail.com on 30 Mar 2015 at 2:49

sianla / memcached

a low-frequency get may got wrong item #401