pmem / libpmemobj-cpp

C++ bindings & containers for libpmemobj
https://pmem.io
Other
108 stars 76 forks source link

concurrent_hash_map example hangs system #962

Closed nisargshah95 closed 4 years ago

nisargshah95 commented 4 years ago

QUESTION: concurrent_hash_map example hangs

Details

When I try to run the concurrent_hash_map example at https://github.com/pmem/libpmemobj-cpp/blob/master/examples/concurrent_hash_map/concurrent_hash_map.cpp after creating a pmem pool using pmempool create obj --layout="concurrent_hash_map" --size 1G --mode 0666 /mnt/ext4-pmem1/myfile

where pmem is mounted on /mnt/ext4-pmem1 in 100% app-direct mode, it simply hangs and I have to reboot the system to get it working again. Even if I run it using a gdb I cannot look at the line it hangs. Any ideas what could cause this?

I'm using PMDK 1.9 with latest master of libpmemobj-cpp on Fedora 30.

igchor commented 4 years ago

Are you sure that /mnt/ext4-pmem1 is mounted with DAX? If it's not then pmemobj will be using msyncs to persist the data and this takes a long time. You can set PMEM_IS_PMEM_FORCE=1 environmental variable to skip msyncs like this:

PMEM_IS_PMEM_FORCE=1 ./examples/example-concurrent_hash_map /mnt/ext4-pmem1/myfile
nisargshah95 commented 4 years ago

Yes its mounted with DAX. I mount it as mount -o dax /dev/pmem1 /mnt/ext4-pmem1

I am using hashmap_tx from an older version of PMDK (1.4 or 1.5 I think) and it worked so far without any issues. I will try if any other example code also shows this behavior for PMDK 1.9.

pbalcer commented 4 years ago

There was a bug a few kernel versions back that would cause a deadlock in the page fault handler logic. Try upgrading your OS to see if that helps. Typically user-space applications, like libpmemobj-cpp, should not be able to hang a system.

nisargshah95 commented 4 years ago

It doesn't hang the system, just the pmem partition. Any operations (ls, etc.) stop working on it until reboot the system. I'll try your advice and see if it works.

pbalcer commented 4 years ago

Still, that means that the file system/kernel is in a softlock - this indicates either a kernel or, in rare scenarios, a hardware problem.

igchor commented 4 years ago

Can you kill the process using kill command? E.g.

kill `pidof ./examples/example-concurrent_hash_map`

If so, you could try running the example under perf then kill the process and see if there are any anomalies.

nisargshah95 commented 4 years ago

I tried killing it but it doesn't help. Once I tried to run it under GDB but couldn't get the stacktrace because the program just hangs and killing it didn't do anything. I currently don't have access to the Optane machine, but I can try running it under perf when I get access again.

nisargshah95 commented 4 years ago

So I tried running the example with perf but did not see anything different. I waited for about 5 minutes and tried to kill the program. I think the original process was killed but I could still see a process with name "[concurrent_hash_map]" in ps output. I couldn't kill it with the kill command. Even with the program hung, ls /mnt/ext4-pmem1/myfile was working. As soon as I tried removing the file using rm -rf /mnt/ext4-pmem1/myfile, the rm command also hung and now even the ls command started hanging.

EDIT: I think it works now after upgrading kernel from 5.0.9 to 5.6.13!

lukaszstolarczuk commented 4 years ago

Thanks for the update and glad it finally worked out.