Closed jti-lanl closed 8 years ago
Go ahead :)
Sent from my BlackBerry 10 smartphone on the Verizon Wireless 4G LTE network. From: Jeff Inman Sent: Tuesday, March 29, 2016 1:54 PM To: mar-file-system/marfs Reply To: mar-file-system/marfs Cc: David Bonnie Subject: [mar-file-system/marfs] Pftool racing / curl / scality (#119)
@thewacokidhttps://github.com/thewacokid found an easily-reproducible problem in pftool, where it was hanging. All tasks were involved in some kind of memory operation, including libcurl.
[waiting for Dave's permission, to quote his email, here.]
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://github.com/mar-file-system/marfs/issues/119
Here's Dave's email:
So I managed to generate a “bad” workload on my first try without meaning to.
No debugging symbols here, but it’s a snapshot of pftool before I kill it off (each from a different rank on the first node, they all look pretty similar):
#1
0x00002ac20d278a38 in opal_memory_ptmalloc2_int_free (av=0x2ac21d600020, mem=) at malloc.c:4453 #2
0x00002ac20d278d03 in opal_memory_ptmalloc2_free (mem=0x2ac21d6167c0) at malloc.c:3511#3
0x00002ac20c9dcd2a in Curl_close () from /usr/lib64/libcurl.so.4#4
0x000000000042fa64 in stream_sync ()#5
0x0000000000430b17 in marfs_release ()#6
0x000000000041eed0 in MARFS_Path::close (this=0xe25180) at Path.h:2201#7
0x000000000041d0e4 in copy_file (src_file=0x7ffc60a418b0, dest_file=0x7ffc60a407e0, blocksize=, rank= , synbuf= ) at pfutils.cpp:1020 #8
0x0000000000422881 in worker_copylist (rank=11, sending_rank=, base_path=0x7ffc60a44b00 "2015", dest_node=0x7ffc60a43a30, o=...) at pftool.cpp:2524 #9
0x0000000000426519 in worker (rank=11, o=...) at pftool.cpp:1125#10
0x0000000000426b59 in main (argc=21, argv=0x7ffc60a48e28) at pftool.cpp:432
#0
0x00002ab42c2b1a30 in malloc_consolidate (av=0x2ab42c547120) at malloc.c:4567#1
0x00002ab42c2b4239 in opal_memory_ptmalloc2_int_malloc (av=0x2ab42c547120, bytes=) at malloc.c:4016 #2
0x00002ab42c2b518f in opal_memory_ptmalloc2_int_memalign (av=0x2ab42c547120, alignment=32, bytes=) at malloc.c:4876 #3
0x00002ab42c2b5da3 in opal_memory_ptmalloc2_memalign (alignment=32, bytes=2229) at malloc.c:3642#4
0x000000000041d178 in copy_file (src_file=0x7ffc6a951500, dest_file=0x7ffc6a950430, blocksize=2229, rank=, synbuf= ) at pfutils.cpp:665 #5
0x0000000000422881 in worker_copylist (rank=5, sending_rank=, base_path=0x7ffc6a954750 "2015", dest_node=0x7ffc6a953680, o=...) at pftool.cpp:2524 #6
0x0000000000426519 in worker (rank=5, o=...) at pftool.cpp:1125#7
0x0000000000426b59 in main (argc=21, argv=0x7ffc6a958a78) at pftool.cpp:432
#1
0x00002b01da686a38 in opal_memory_ptmalloc2_int_free (av=0x2b01da91b120, mem=) at malloc.c:4453 #2
0x00002b01da686d03 in opal_memory_ptmalloc2_free (mem=0x1879880) at malloc.c:3511#3
0x00002b01d9deacee in Curl_close () from /usr/lib64/libcurl.so.4#4
0x000000000042fa64 in stream_sync ()#5
0x0000000000430b17 in marfs_release ()#6
0x000000000041eed0 in MARFS_Path::close (this=0x1875180) at Path.h:2201#7
0x000000000041d0e4 in copy_file (src_file=0x7ffe829a89e0, dest_file=0x7ffe829a7910, blocksize=, rank= , synbuf= ) at pfutils.cpp:1020 #8
0x0000000000422881 in worker_copylist (rank=8, sending_rank=, base_path=0x7ffe829abc30 "2015", dest_node=0x7ffe829aab60, o=...) at pftool.cpp:2524 #9
0x0000000000426519 in worker (rank=8, o=...) at pftool.cpp:1125#10
0x0000000000426b59 in main (argc=21, argv=0x7ffe829aff58) at pftool.cpp:432
#1
0x00002ac20d278a38 in opal_memory_ptmalloc2_int_free (av=0x2ac21d600020, mem=) at malloc.c:4453 #2
0x00002ac20d278d03 in opal_memory_ptmalloc2_free (mem=0x2ac21d6167c0) at malloc.c:3511#3
0x00002ac20c9dcd2a in Curl_close () from /usr/lib64/libcurl.so.4#4
0x000000000042fa64 in stream_sync ()#5
0x0000000000430b17 in marfs_release ()#6
0x000000000041eed0 in MARFS_Path::close (this=0xe25180) at Path.h:2201#7
0x000000000041d0e4 in copy_file (src_file=0x7ffc60a418b0, dest_file=0x7ffc60a407e0, blocksize=, rank= , synbuf= ) at pfutils.cpp:1020 #8
0x0000000000422881 in worker_copylist (rank=11, sending_rank=, base_path=0x7ffc60a44b00 "2015", dest_node=0x7ffc60a43a30, o=...) at pftool.cpp:2524 #9
0x0000000000426519 in worker (rank=11, o=...) at pftool.cpp:1125#10
0x0000000000426b59 in main (argc=21, argv=0x7ffc60a48e28) at pftool.cpp:432I’ll keep digging, but my primary focus is figuring out why “-n” isn’t working.
Clues:
After cleaning up some memory leaks in libaws4c, and tweaking the libaws4c calls in stream_open(), I can't reproduce the problem anymore.
TBD: There is still one valgrind nit to pick.
Moving the valgrind nit to a new issue, so I can close this.
So, I know this is closed, but just as confirmation (since this bailed out improperly before, and restart didn't handle it well even when forced) - a 100K file copy:
[dbonnie@fta01 dbonnie]$ pfcp -Rvn /lustre/scratch/dbonnie/ /campaign/admins/
… TRUNCATED OUTPUT …
INFO FOOTER ======================== NONFATAL ERRORS = 0 ================================
INFO FOOTER =================================================================================
INFO FOOTER Total Files/Links Examined: 100986
INFO FOOTER Total Dirs Examined: 3794
INFO FOOTER Total Buffers Written: 100989
INFO FOOTER Total Bytes Copied: 7462867207
INFO FOOTER Total Megabytes Copied: 7117
INFO FOOTER Data Rate: 10 MB/second
INFO FOOTER Elapsed Time: 663 seconds
Launched /opt/campaign/pftool/installed/bin/pfcp from host fta04.localdomain at: Thu Mar 31 10:13:33 MDT 2016
Job finished at: Thu Mar 31 10:24:38 MDT 2016
[dbonnie@fta01 dbonnie]$ pfcp -Rn /lustre/scratch/dbonnie/ /campaign/admins/
"/lustre/scratch/dbonnie" pfcp -Rn /lustre/scratch/dbonnie/ /campaign/admins/
Debugging: dest_path '/campaign/admins' -> dest_node '/campaign/admins/dbonnie'
Debugging: Path subclass is 'MARFS_Path'
Debugging: created directory '/campaign/admins/dbonnie'
manager: creating temp_path /campaign
INFO HEADER ======================== dbonnie5031163132016fta04.localdomain ============================
INFO HEADER Starting Path: /lustre/scratch/dbonnie
INFO FOOTER ======================== NONFATAL ERRORS = 0 ================================
INFO FOOTER =================================================================================
INFO FOOTER Total Files/Links Examined: 100986
INFO FOOTER Total Dirs Examined: 3794
INFO FOOTER Total Buffers Written: 0
INFO FOOTER Total Bytes Copied: 0
INFO FOOTER Elapsed Time: 6 seconds
Launched /opt/campaign/pftool/installed/bin/pfcp from host fta04.localdomain at: Thu Mar 31 10:31:50 MDT 2016
Job finished at: Thu Mar 31 10:31:59 MDT 2016
Yay.
On Mar 31, 2016, at 10:35 AM, David Bonnie notifications@github.com wrote:
So, I know this is closed, but just as confirmation (since this bailed out improperly before, and restart didn't handle it well even when force) - a 100K file copy:
[dbonnie@fta01 dbonnie]$ pfcp -Rvn /lustre/scratch/dbonnie/ /campaign/admins/ … TRUNCATED OUTPUT …
INFO FOOTER ======================== NONFATAL ERRORS = 0 ================================ INFO FOOTER ================================================================================= INFO FOOTER Total Files/Links Examined: 100986 INFO FOOTER Total Dirs Examined: 3794 INFO FOOTER Total Buffers Written: 100989 INFO FOOTER Total Bytes Copied: 7462867207 INFO FOOTER Total Megabytes Copied: 7117 INFO FOOTER Data Rate: 10 MB/second INFO FOOTER Elapsed Time: 663 seconds Launched /opt/campaign/pftool/installed/bin/pfcp from host fta04.localdomain at: Thu Mar 31 10:13:33 MDT 2016 Job finished at: Thu Mar 31 10:24:38 MDT 2016 [dbonnie@fta01 dbonnie]$ pfcp -Rn /lustre/scratch/dbonnie/ /campaign/admins/ "/lustre/scratch/dbonnie" pfcp -Rn /lustre/scratch/dbonnie/ /campaign/admins/ Debugging: dest_path '/campaign/admins' -> dest_node '/campaign/admins/dbonnie' Debugging: Path subclass is 'MARFS_Path' Debugging: created directory '/campaign/admins/dbonnie' manager: creating temp_path /campaign INFO HEADER ======================== dbonnie5031163132016fta04.localdomain ============================ INFO HEADER Starting Path: /lustre/scratch/dbonnie INFO FOOTER ======================== NONFATAL ERRORS = 0 ================================ INFO FOOTER ================================================================================= INFO FOOTER Total Files/Links Examined: 100986 INFO FOOTER Total Dirs Examined: 3794 INFO FOOTER Total Buffers Written: 0 INFO FOOTER Total Bytes Copied: 0 INFO FOOTER Elapsed Time: 6 seconds Launched /opt/campaign/pftool/installed/bin/pfcp from host fta04.localdomain at: Thu Mar 31 10:31:50 MDT 2016 Job finished at: Thu Mar 31 10:31:59 MDT 2016
— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub
@thewacokid found an easily-reproducible problem in pftool, where it was hanging. All tasks were involved in some kind of memory operation, including libcurl.
[waiting for Dave's permission, to quote his email, here.]