jexp / store-utils

Utilities to compact, copy, fix, analyse Neo4j stores
Apache License 2.0
94 stars 34 forks source link

copy-store hangs when jar is executed #44

Open eduards-v opened 5 years ago

eduards-v commented 5 years ago

Hi,

store-util script hangs when jar is executed. store-utils release 3.4.5 Neo4j version 3.4.10 graph.db size 246 GB

[neo4j@ieatrcxb6435 store-util-3.4.5]$ ./copy-store.sh enterprise ../neo4j_data/databases/graph.db ../backup/neo4j_restore/fixed.db
To use your existing Neo4j 3.3.x installation set NEO4J_HOME to your Neo4j directory. Currently set to: /../../neo4j/
Usage: copy-store.sh [community|enterprise] source.db target.db [RELS,TO,SKIP] [props,to,skip] [Labels,To,Skip] [Labels,To,Delete,Nodes]

Using: Heap 18G Pagecache 25G Edition 'enterprise' from '/.../neo4j_data/databases/graph.db' to '/.../.../.../neo4j_restore/fixed.db' skipping labels: '', removing nodes with labels: '' rels: '' props '' Keeping Node Ids:

Please note that you will need this memory (25G + 25G + 18G) as it opens 2 databases one for reading and one for writing.

Java process starting....
Copying from /.../neo4j_data/databases/graph.db to /.../.../.../neo4j_restore/fixed.db ingoring rel-types [] ignoring properties [] ignoring labels [] removing nodes with labels [] keep node ids true

I've experimented with different memory and cache params, no help.

Jar creates fixed.db directory and store_lock file, but nothing happens...

Is there any logs stored to monitor process? Could process be blocked somewhere by OS? Running on Redhat Linux.

Ran on different deployments, worked as expected.

Kind Regards, Eduards

jexp commented 5 years ago

Did you check IO and CPU load?

Could you create a threaddump with jstack <pid> or kill -3 <pid>

jexp commented 5 years ago

How much heap + page-cache did you have configured?

eduards-v commented 5 years ago
2019-01-25 23:02:48
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.172-b31 mixed mode):

"Service Thread" #21 daemon prio=9 os_prio=0 tid=0x00007f2059960800 nid=0x281c runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread14" #20 daemon prio=9 os_prio=0 tid=0x00007f205995d800 nid=0x281b waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread13" #19 daemon prio=9 os_prio=0 tid=0x00007f205995b000 nid=0x281a waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread12" #18 daemon prio=9 os_prio=0 tid=0x00007f2059959000 nid=0x2819 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread11" #17 daemon prio=9 os_prio=0 tid=0x00007f2059957000 nid=0x2818 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread10" #16 daemon prio=9 os_prio=0 tid=0x00007f2059955000 nid=0x2817 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread9" #15 daemon prio=9 os_prio=0 tid=0x00007f2059952000 nid=0x2816 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread8" #14 daemon prio=9 os_prio=0 tid=0x00007f205994f800 nid=0x2815 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread7" #13 daemon prio=9 os_prio=0 tid=0x00007f205994d800 nid=0x2814 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread6" #12 daemon prio=9 os_prio=0 tid=0x00007f205994b800 nid=0x2813 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread5" #11 daemon prio=9 os_prio=0 tid=0x00007f2059949000 nid=0x2812 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread4" #10 daemon prio=9 os_prio=0 tid=0x00007f2059947000 nid=0x2811 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread3" #9 daemon prio=9 os_prio=0 tid=0x00007f205993c800 nid=0x2810 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread2" #8 daemon prio=9 os_prio=0 tid=0x00007f205993a800 nid=0x280f waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" #7 daemon prio=9 os_prio=0 tid=0x00007f2059938000 nid=0x280e waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #6 daemon prio=9 os_prio=0 tid=0x00007f2059936000 nid=0x280d waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" #5 daemon prio=9 os_prio=0 tid=0x00007f2059933800 nid=0x280c waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Surrogate Locker Thread (Concurrent GC)" #4 daemon prio=9 os_prio=0 tid=0x00007f2059932000 nid=0x280b waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f20598ff000 nid=0x280a in Object.wait() [0x00007f1f10272000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000007a70ebad0> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
        - locked <0x00000007a70ebad0> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f20598fc800 nid=0x2809 in Object.wait() [0x00007f1f10373000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000007a70f1bc8> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
        - locked <0x00000007a70f1bc8> (a java.lang.ref.Reference$Lock)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"main" #1 prio=5 os_prio=0 tid=0x00007f205800d800 nid=0x27c1 runnable [0x00007f20600df000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:94)
        at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1120)
        at java.nio.channels.FileChannel.tryLock(FileChannel.java:1155)
        at org.neo4j.io.fs.StoreFileChannel.tryLock(StoreFileChannel.java:142)
        at org.neo4j.kernel.internal.locker.StoreLocker.checkLock(StoreLocker.java:89)
        at org.neo4j.kernel.internal.locker.GlobalStoreLocker.checkLock(GlobalStoreLocker.java:60)
        at org.neo4j.unsafe.batchinsert.internal.BatchInserterImpl.tryLockStore(BatchInserterImpl.java:327)
        at org.neo4j.unsafe.batchinsert.internal.BatchInserterImpl.<init>(BatchInserterImpl.java:240)
        at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:83)
        at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:63)
        at org.neo4j.tool.StoreCopy.copyStore(StoreCopy.java:75)
        at org.neo4j.tool.StoreCopy.main(StoreCopy.java:48)

"VM Thread" os_prio=0 tid=0x00007f20598f3000 nid=0x2808 runnable

"Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058025800 nid=0x27c6 runnable

"Gang worker#1 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058027000 nid=0x27c7 runnable

"Gang worker#2 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058029000 nid=0x27c8 runnable

"Gang worker#3 (Parallel GC Threads)" os_prio=0 tid=0x00007f205802b000 nid=0x27c9 runnable

"Gang worker#4 (Parallel GC Threads)" os_prio=0 tid=0x00007f205802c800 nid=0x27ca runnable

"Gang worker#5 (Parallel GC Threads)" os_prio=0 tid=0x00007f205802e800 nid=0x27cb runnable

"Gang worker#6 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058030800 nid=0x27cc runnable

"Gang worker#7 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058032000 nid=0x27cd runnable

"Gang worker#8 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058034000 nid=0x27ce runnable

"Gang worker#9 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058036000 nid=0x27cf runnable

"Gang worker#10 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058037800 nid=0x27d0 runnable

"Gang worker#11 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058039800 nid=0x27d1 runnable

"Gang worker#12 (Parallel GC Threads)" os_prio=0 tid=0x00007f205803b000 nid=0x27d2 runnable

"Gang worker#13 (Parallel GC Threads)" os_prio=0 tid=0x00007f205803d000 nid=0x27d3 runnable

"Gang worker#14 (Parallel GC Threads)" os_prio=0 tid=0x00007f205803f000 nid=0x27d4 runnable

"Gang worker#15 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058040800 nid=0x27d5 runnable

"Gang worker#16 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058042800 nid=0x27d6 runnable

"Gang worker#17 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058044800 nid=0x27d7 runnable

"Gang worker#18 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058046000 nid=0x27d8 runnable

"Gang worker#19 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058048000 nid=0x27d9 runnable

"Gang worker#20 (Parallel GC Threads)" os_prio=0 tid=0x00007f205804a000 nid=0x27da runnable

"Gang worker#21 (Parallel GC Threads)" os_prio=0 tid=0x00007f205804b800 nid=0x27db runnable

"Gang worker#22 (Parallel GC Threads)" os_prio=0 tid=0x00007f205804d800 nid=0x27dd runnable

"Gang worker#23 (Parallel GC Threads)" os_prio=0 tid=0x00007f205804f000 nid=0x27de runnable

"Gang worker#24 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058051000 nid=0x27df runnable

"Gang worker#25 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058053000 nid=0x27e0 runnable

"Gang worker#26 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058054800 nid=0x27e1 runnable

"Gang worker#27 (Parallel GC Threads)" os_prio=0 tid=0x00007f2058056800 nid=0x27e2 runnable

"G1 Main Concurrent Mark GC Thread" os_prio=0 tid=0x00007f205809e800 nid=0x2800 runnable

"Gang worker#0 (G1 Parallel Marking Threads)" os_prio=0 tid=0x00007f20580a0000 nid=0x2801 runnable

"Gang worker#1 (G1 Parallel Marking Threads)" os_prio=0 tid=0x00007f20580a2000 nid=0x2802 runnable

"Gang worker#2 (G1 Parallel Marking Threads)" os_prio=0 tid=0x00007f20580a4000 nid=0x2803 runnable

"Gang worker#3 (G1 Parallel Marking Threads)" os_prio=0 tid=0x00007f20580a5800 nid=0x2804 runnable

"Gang worker#4 (G1 Parallel Marking Threads)" os_prio=0 tid=0x00007f20580a7800 nid=0x2805 runnable

"Gang worker#5 (G1 Parallel Marking Threads)" os_prio=0 tid=0x00007f20580a9800 nid=0x2806 runnable

"Gang worker#6 (G1 Parallel Marking Threads)" os_prio=0 tid=0x00007f20580ab000 nid=0x2807 runnable

"G1 Concurrent Refinement Thread#0" os_prio=0 tid=0x00007f205808e000 nid=0x27ff runnable

"G1 Concurrent Refinement Thread#1" os_prio=0 tid=0x00007f205808c000 nid=0x27fe runnable

"G1 Concurrent Refinement Thread#2" os_prio=0 tid=0x00007f205808a800 nid=0x27fd runnable

"G1 Concurrent Refinement Thread#3" os_prio=0 tid=0x00007f2058088800 nid=0x27fc runnable

"G1 Concurrent Refinement Thread#4" os_prio=0 tid=0x00007f2058086800 nid=0x27fb runnable

"G1 Concurrent Refinement Thread#5" os_prio=0 tid=0x00007f2058085000 nid=0x27fa runnable

"G1 Concurrent Refinement Thread#6" os_prio=0 tid=0x00007f2058083000 nid=0x27f9 runnable

"G1 Concurrent Refinement Thread#7" os_prio=0 tid=0x00007f2058081000 nid=0x27f8 runnable

"G1 Concurrent Refinement Thread#8" os_prio=0 tid=0x00007f205807f000 nid=0x27f7 runnable

"G1 Concurrent Refinement Thread#9" os_prio=0 tid=0x00007f205807d800 nid=0x27f6 runnable

"G1 Concurrent Refinement Thread#10" os_prio=0 tid=0x00007f205807b800 nid=0x27f5 runnable

"G1 Concurrent Refinement Thread#11" os_prio=0 tid=0x00007f2058079800 nid=0x27f4 runnable

"G1 Concurrent Refinement Thread#12" os_prio=0 tid=0x00007f2058078000 nid=0x27f3 runnable

"G1 Concurrent Refinement Thread#13" os_prio=0 tid=0x00007f2058076000 nid=0x27f2 runnable

"G1 Concurrent Refinement Thread#14" os_prio=0 tid=0x00007f2058074000 nid=0x27f1 runnable

"G1 Concurrent Refinement Thread#15" os_prio=0 tid=0x00007f2058072800 nid=0x27f0 runnable

"G1 Concurrent Refinement Thread#16" os_prio=0 tid=0x00007f2058070800 nid=0x27ef runnable

"G1 Concurrent Refinement Thread#17" os_prio=0 tid=0x00007f205806e800 nid=0x27ee runnable

"G1 Concurrent Refinement Thread#18" os_prio=0 tid=0x00007f205806d000 nid=0x27ed runnable

"G1 Concurrent Refinement Thread#19" os_prio=0 tid=0x00007f205806b000 nid=0x27ec runnable

"G1 Concurrent Refinement Thread#20" os_prio=0 tid=0x00007f2058069000 nid=0x27eb runnable

"G1 Concurrent Refinement Thread#21" os_prio=0 tid=0x00007f2058067800 nid=0x27ea runnable

"G1 Concurrent Refinement Thread#22" os_prio=0 tid=0x00007f2058065800 nid=0x27e9 runnable

"G1 Concurrent Refinement Thread#23" os_prio=0 tid=0x00007f2058063800 nid=0x27e8 runnable

"G1 Concurrent Refinement Thread#24" os_prio=0 tid=0x00007f2058061800 nid=0x27e7 runnable

"G1 Concurrent Refinement Thread#25" os_prio=0 tid=0x00007f2058060000 nid=0x27e6 runnable

"G1 Concurrent Refinement Thread#26" os_prio=0 tid=0x00007f205805e000 nid=0x27e5 runnable

"G1 Concurrent Refinement Thread#27" os_prio=0 tid=0x00007f205805c000 nid=0x27e4 runnable

"G1 Concurrent Refinement Thread#28" os_prio=0 tid=0x00007f205805a800 nid=0x27e3 runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00007f2059964000 nid=0x281d waiting on condition

JNI global references: 1124

Heap
 garbage-first heap   total 18874368K, used 464022K [0x0000000340000000, 0x0000000340804800, 0x00000007c0000000)
  region size 8192K, 58 young (475136K), 14 survivors (114688K)
 Metaspace       used 31084K, capacity 32711K, committed 32896K, reserved 1077248K
  class space    used 4351K, capacity 4685K, committed 4736K, reserved 1048576K

Had to restart process now to get this dump due to screen session buffer absorbed most of the dump from previous run. Would this be enough or I leave it to hang for longer?

I used 18G heap 25G cache in last ran. Normally we give Neo4j 48G heap and 140G cache. Initially I started script by giving 24G heap, 70G cache and 70G cache_src since util opens 2 databases, then subsequently reduced those values in consecutive runs.

System has 250G of ram and only about 60G in use while store util runs...

jexp commented 5 years ago

Hmm, odd, it seems to lock on store.lock.

Are you sure the store/server was shut down before? If you're sure that it's a backup or a safe copy, you can remove the store_lock file.

And then try again.

eduards-v commented 5 years ago

Removing store_lock from graph.db doesn't help.

eduards-v commented 5 years ago

Hi,

We found an issue, for some reason on our deployments, store-util is not able to write to a database if nfs is used to store compacted database. It works on a local mount.

I guess issue can be closed. Thank you.

zj1429630644 commented 7 months ago

I also encountered the same problem. May I ask what happened at that time?

jexp commented 7 months ago

Store copy is a part of neo4j-admin since version 4.0