mar-file-system / marfs

MarFS provides a scalable near-POSIX file system by using one or more POSIX file systems as a scalable metadata component and one or more data stores (object, file, etc) as a scalable data component.
Other
96 stars 26 forks source link

Pftool racing / curl / scality #119

Closed jti-lanl closed 8 years ago

jti-lanl commented 8 years ago

@thewacokid found an easily-reproducible problem in pftool, where it was hanging. All tasks were involved in some kind of memory operation, including libcurl.

[waiting for Dave's permission, to quote his email, here.]

thewacokid commented 8 years ago

Go ‎ahead :)

Sent from my BlackBerry 10 smartphone on the Verizon Wireless 4G LTE network. From: Jeff Inman Sent: Tuesday, March 29, 2016 1:54 PM To: mar-file-system/marfs Reply To: mar-file-system/marfs Cc: David Bonnie Subject: [mar-file-system/marfs] Pftool racing / curl / scality (#119)

@thewacokidhttps://github.com/thewacokid found an easily-reproducible problem in pftool, where it was hanging. All tasks were involved in some kind of memory operation, including libcurl.

[waiting for Dave's permission, to quote his email, here.]

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://github.com/mar-file-system/marfs/issues/119

jti-lanl commented 8 years ago

Here's Dave's email:

So I managed to generate a “bad” workload on my first try without meaning to.

No debugging symbols here, but it’s a snapshot of pftool before I kill it off (each from a different rank on the first node, they all look pretty similar):

#1 0x00002ac20d278a38 in opal_memory_ptmalloc2_int_free (av=0x2ac21d600020, mem=) at malloc.c:4453 #2 0x00002ac20d278d03 in opal_memory_ptmalloc2_free (mem=0x2ac21d6167c0) at malloc.c:3511 #3 0x00002ac20c9dcd2a in Curl_close () from /usr/lib64/libcurl.so.4 #4 0x000000000042fa64 in stream_sync () #5 0x0000000000430b17 in marfs_release () #6 0x000000000041eed0 in MARFS_Path::close (this=0xe25180) at Path.h:2201 #7 0x000000000041d0e4 in copy_file (src_file=0x7ffc60a418b0, dest_file=0x7ffc60a407e0, blocksize=, rank=, synbuf=) at pfutils.cpp:1020 #8 0x0000000000422881 in worker_copylist (rank=11, sending_rank=, base_path=0x7ffc60a44b00 "2015", dest_node=0x7ffc60a43a30, o=...) at pftool.cpp:2524 #9 0x0000000000426519 in worker (rank=11, o=...) at pftool.cpp:1125 #10 0x0000000000426b59 in main (argc=21, argv=0x7ffc60a48e28) at pftool.cpp:432

#0 0x00002ab42c2b1a30 in malloc_consolidate (av=0x2ab42c547120) at malloc.c:4567 #1 0x00002ab42c2b4239 in opal_memory_ptmalloc2_int_malloc (av=0x2ab42c547120, bytes=) at malloc.c:4016 #2 0x00002ab42c2b518f in opal_memory_ptmalloc2_int_memalign (av=0x2ab42c547120, alignment=32, bytes=) at malloc.c:4876 #3 0x00002ab42c2b5da3 in opal_memory_ptmalloc2_memalign (alignment=32, bytes=2229) at malloc.c:3642 #4 0x000000000041d178 in copy_file (src_file=0x7ffc6a951500, dest_file=0x7ffc6a950430, blocksize=2229, rank=, synbuf=) at pfutils.cpp:665 #5 0x0000000000422881 in worker_copylist (rank=5, sending_rank=, base_path=0x7ffc6a954750 "2015", dest_node=0x7ffc6a953680, o=...) at pftool.cpp:2524 #6 0x0000000000426519 in worker (rank=5, o=...) at pftool.cpp:1125 #7 0x0000000000426b59 in main (argc=21, argv=0x7ffc6a958a78) at pftool.cpp:432

#1 0x00002b01da686a38 in opal_memory_ptmalloc2_int_free (av=0x2b01da91b120, mem=) at malloc.c:4453 #2 0x00002b01da686d03 in opal_memory_ptmalloc2_free (mem=0x1879880) at malloc.c:3511 #3 0x00002b01d9deacee in Curl_close () from /usr/lib64/libcurl.so.4 #4 0x000000000042fa64 in stream_sync () #5 0x0000000000430b17 in marfs_release () #6 0x000000000041eed0 in MARFS_Path::close (this=0x1875180) at Path.h:2201 #7 0x000000000041d0e4 in copy_file (src_file=0x7ffe829a89e0, dest_file=0x7ffe829a7910, blocksize=, rank=, synbuf=) at pfutils.cpp:1020 #8 0x0000000000422881 in worker_copylist (rank=8, sending_rank=, base_path=0x7ffe829abc30 "2015", dest_node=0x7ffe829aab60, o=...) at pftool.cpp:2524 #9 0x0000000000426519 in worker (rank=8, o=...) at pftool.cpp:1125 #10 0x0000000000426b59 in main (argc=21, argv=0x7ffe829aff58) at pftool.cpp:432

#1 0x00002ac20d278a38 in opal_memory_ptmalloc2_int_free (av=0x2ac21d600020, mem=) at malloc.c:4453 #2 0x00002ac20d278d03 in opal_memory_ptmalloc2_free (mem=0x2ac21d6167c0) at malloc.c:3511 #3 0x00002ac20c9dcd2a in Curl_close () from /usr/lib64/libcurl.so.4 #4 0x000000000042fa64 in stream_sync () #5 0x0000000000430b17 in marfs_release () #6 0x000000000041eed0 in MARFS_Path::close (this=0xe25180) at Path.h:2201 #7 0x000000000041d0e4 in copy_file (src_file=0x7ffc60a418b0, dest_file=0x7ffc60a407e0, blocksize=, rank=, synbuf=) at pfutils.cpp:1020 #8 0x0000000000422881 in worker_copylist (rank=11, sending_rank=, base_path=0x7ffc60a44b00 "2015", dest_node=0x7ffc60a43a30, o=...) at pftool.cpp:2524 #9 0x0000000000426519 in worker (rank=11, o=...) at pftool.cpp:1125 #10 0x0000000000426b59 in main (argc=21, argv=0x7ffc60a48e28) at pftool.cpp:432

I’ll keep digging, but my primary focus is figuring out why “-n” isn’t working.

jti-lanl commented 8 years ago

Clues:

jti-lanl commented 8 years ago

After cleaning up some memory leaks in libaws4c, and tweaking the libaws4c calls in stream_open(), I can't reproduce the problem anymore.

TBD: There is still one valgrind nit to pick.

jti-lanl commented 8 years ago

Moving the valgrind nit to a new issue, so I can close this.

thewacokid commented 8 years ago

So, I know this is closed, but just as confirmation (since this bailed out improperly before, and restart didn't handle it well even when forced) - a 100K file copy:

[dbonnie@fta01 dbonnie]$ pfcp -Rvn /lustre/scratch/dbonnie/ /campaign/admins/ … TRUNCATED OUTPUT …
INFO FOOTER ======================== NONFATAL ERRORS = 0 ================================ INFO FOOTER ================================================================================= INFO FOOTER Total Files/Links Examined: 100986 INFO FOOTER Total Dirs Examined: 3794 INFO FOOTER Total Buffers Written: 100989 INFO FOOTER Total Bytes Copied: 7462867207 INFO FOOTER Total Megabytes Copied: 7117 INFO FOOTER Data Rate: 10 MB/second INFO FOOTER Elapsed Time: 663 seconds Launched /opt/campaign/pftool/installed/bin/pfcp from host fta04.localdomain at: Thu Mar 31 10:13:33 MDT 2016 Job finished at: Thu Mar 31 10:24:38 MDT 2016 [dbonnie@fta01 dbonnie]$ pfcp -Rn /lustre/scratch/dbonnie/ /campaign/admins/ "/lustre/scratch/dbonnie" pfcp -Rn /lustre/scratch/dbonnie/ /campaign/admins/ Debugging: dest_path '/campaign/admins' -> dest_node '/campaign/admins/dbonnie' Debugging: Path subclass is 'MARFS_Path' Debugging: created directory '/campaign/admins/dbonnie' manager: creating temp_path /campaign INFO HEADER ======================== dbonnie5031163132016fta04.localdomain ============================ INFO HEADER Starting Path: /lustre/scratch/dbonnie INFO FOOTER ======================== NONFATAL ERRORS = 0 ================================ INFO FOOTER ================================================================================= INFO FOOTER Total Files/Links Examined: 100986 INFO FOOTER Total Dirs Examined: 3794 INFO FOOTER Total Buffers Written: 0 INFO FOOTER Total Bytes Copied: 0 INFO FOOTER Elapsed Time: 6 seconds Launched /opt/campaign/pftool/installed/bin/pfcp from host fta04.localdomain at: Thu Mar 31 10:31:50 MDT 2016 Job finished at: Thu Mar 31 10:31:59 MDT 2016

jti-lanl commented 8 years ago

Yay.

On Mar 31, 2016, at 10:35 AM, David Bonnie notifications@github.com wrote:

So, I know this is closed, but just as confirmation (since this bailed out improperly before, and restart didn't handle it well even when force) - a 100K file copy:

[dbonnie@fta01 dbonnie]$ pfcp -Rvn /lustre/scratch/dbonnie/ /campaign/admins/ … TRUNCATED OUTPUT …

INFO FOOTER ======================== NONFATAL ERRORS = 0 ================================ INFO FOOTER ================================================================================= INFO FOOTER Total Files/Links Examined: 100986 INFO FOOTER Total Dirs Examined: 3794 INFO FOOTER Total Buffers Written: 100989 INFO FOOTER Total Bytes Copied: 7462867207 INFO FOOTER Total Megabytes Copied: 7117 INFO FOOTER Data Rate: 10 MB/second INFO FOOTER Elapsed Time: 663 seconds Launched /opt/campaign/pftool/installed/bin/pfcp from host fta04.localdomain at: Thu Mar 31 10:13:33 MDT 2016 Job finished at: Thu Mar 31 10:24:38 MDT 2016 [dbonnie@fta01 dbonnie]$ pfcp -Rn /lustre/scratch/dbonnie/ /campaign/admins/ "/lustre/scratch/dbonnie" pfcp -Rn /lustre/scratch/dbonnie/ /campaign/admins/ Debugging: dest_path '/campaign/admins' -> dest_node '/campaign/admins/dbonnie' Debugging: Path subclass is 'MARFS_Path' Debugging: created directory '/campaign/admins/dbonnie' manager: creating temp_path /campaign INFO HEADER ======================== dbonnie5031163132016fta04.localdomain ============================ INFO HEADER Starting Path: /lustre/scratch/dbonnie INFO FOOTER ======================== NONFATAL ERRORS = 0 ================================ INFO FOOTER ================================================================================= INFO FOOTER Total Files/Links Examined: 100986 INFO FOOTER Total Dirs Examined: 3794 INFO FOOTER Total Buffers Written: 0 INFO FOOTER Total Bytes Copied: 0 INFO FOOTER Elapsed Time: 6 seconds Launched /opt/campaign/pftool/installed/bin/pfcp from host fta04.localdomain at: Thu Mar 31 10:31:50 MDT 2016 Job finished at: Thu Mar 31 10:31:59 MDT 2016

— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub