secondlife / jira-archive

2 stars 0 forks source link

[BUG-6883] Crash in LLImageBase::getData() due to arbitrary limits in heap management on Windows #14683

Open sl-service-account opened 10 years ago

sl-service-account commented 10 years ago

Steps to Reproduce

Normal operations. Viewer tends to be up and running.

Actual Behavior

Reviewing VCB reports of crashes from the library refresh viewer (https://osiris.lindenlab.com/viewer_crash_browser/index.php?filter_id=37217).

There's a common crash that's seen in all viewers associated with memory exhaustion in image allocation:


[Thread-Crashed] (No Notes) Filter on Call Stack
[0] LLError::crashAndLoop(std::basic_string<char,std::char_traits<char>,std::allocator<char> > const &)
/llerror.cpp:1273
[1] boost::detail::function::void_function_invoker1<void (*)(LLUUID const &),void,LLUUID const &>::invoke(boost::detail::function::function_buffer &,LLUUID const &)
/function_template.hpp:153
[2] boost::function1<int,std::basic_string<unsigned int,std::char_traits<unsigned int>,std::allocator<unsigned int> > const &>::operator()(std::basic_string<unsigned int,std::char_traits<unsigned int>,std::allocator<unsigned int> > const &)
/function_template.hpp:767
[3] LLError::Log::flush(std::basic_ostringstream<char,std::char_traits<char>,std::allocator<char> > *,LLError::CallSite const &)
/llerror.cpp:1200
[4] LLImageBase::getData()
/llimage.cpp:247
[5] LLImageJ2CKDU::decodeImpl(LLImageJ2C &,LLImageRaw &,float,int,int)
/llimagej2ckdu.cpp:462
[6] LLImageJ2C::decodeChannels(LLImageRaw *,float,int,int)
/llimagej2c.cpp:182
[7] LLImageJ2C::decode(LLImageRaw *,float)
/llimagej2c.cpp:158
[8] LLImageDecodeThread::ImageRequest::processRequest()
/llimageworker.cpp:145
[9] LLQueuedThread::processNextRequest()
/llqueuedthread.cpp:446
[10] LLQueuedThread::run()
/llqueuedthread.cpp:514
[11] LLThread::staticRun(apr_thread_t *,void *)
/llthread.cpp:142
[12] dummy_worker
/thread.c:79
[13] _callthreadstartex
/threadex.c:314
[14] _threadstartex
/threadex.c:292
[15] BaseThreadStart
/Unknown:0

VCB: https://osiris.lindenlab.com/viewer_crash_browser/index.php?filter_id=37228

Log file signature shows allocation failures (which set a flag on the LLImage object) followed by a delayed LLERRS crash when that flag is later tested during ::getData():


2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]
2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [1572864]
2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]
2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]
2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]
2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [786432]
llimage/llimage.cpp(247) : 2014-08-05T14:25:34Z error
llimage/llimage.cpp(247) : 2014-08-05T14:25:34Z ERROR: LLImageBase::getData: Bad memory allocation for the image buffer!

Looking at llmemory.cpp quickly, this seems to be the result of a complicated, fussy and 64-bit-hostile design. Process working sets (this is a windows-only crash) are about 1GB so there's plenty of room, we're just hitting the private pool limit, I think, and dying in unrelated code. Really appears unnecessary.

Here's another manifestation:


[Thread-Crashed] (No Notes) Filter on Call Stack
[0] LLError::crashAndLoop(std::basic_string<char,std::char_traits<char>,std::allocator<char> > const &)
/llerror.cpp:1273
[1] boost::detail::function::void_function_invoker1<void (*)(LLUUID const &),void,LLUUID const &>::invoke(boost::detail::function::function_buffer &,LLUUID const &)
/function_template.hpp:153
[2] boost::function1<int,std::basic_string<unsigned int,std::char_traits<unsigned int>,std::allocator<unsigned int> > const &>::operator()(std::basic_string<unsigned int,std::char_traits<unsigned int>,std::allocator<unsigned int> > const &)
/function_template.hpp:767
[3] LLError::Log::flush(std::basic_ostringstream<char,std::char_traits<char>,std::allocator<char> > *,LLError::CallSite const &)
/llerror.cpp:1200
[4] LLImageRaw::scale(int,int,int)
/llimage.cpp:884
[5] LLViewerFetchedTexture::addToCreateTexture()
/llviewertexture.cpp:1349
[6] LLViewerFetchedTexture::updateFetch()
/llviewertexture.cpp:1910
[7] LLViewerTextureList::updateImagesFetchTextures(float)
/llviewertexturelist.cpp:1079
[8] LLViewerTextureList::updateImages(float)
/llviewertexturelist.cpp:718
[9] display(int,float,int,int)
/llviewerdisplay.cpp:793
[10] LLAppViewer::mainLoop()
/llappviewer.cpp:1454
[11] WinMain
/llappviewerwin32.cpp:322
[12] __tmainCRTStartup
/crtexe.c:547
[13] BaseThreadInitThunk
/Unknown:0
[14] __RtlUserThreadStart
/Unknown:0
[15] _RtlUserThreadStart
/Unknown:0

In the above case, the allocation has failed likely leaving the image dimensions damaged (namely zero). A resize operation then fails on an assert_always() test as the resized image will be zero-length. Log file in this case looks like:


2014-08-04T19:37:09Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [1048576]
llimage/llimage.cpp(884) : 2014-08-04T19:37:09Z error
llimage/llimage.cpp(884) : 2014-08-04T19:37:09Z ERROR: LLImageRaw::scale: ASSERT (temp_data_size > 0)

VCB: https://osiris.lindenlab.com/viewer_crash_browser/index.php?filter_id=37229

And another:


2014-08-04T00:02:55Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]
2014-08-04T00:02:55Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]
2014-08-04T00:02:55Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]
2014-08-04T00:02:55Z WARNING: LLViewerTextureList::processImageNotInDatabase: not in db
2014-08-04T00:02:55Z WARNING: LLViewerFetchedTexture::setIsMissingAsset: 95f7a142-b0f5-4e1b-bb78-7f095f5d29bc: Marking image as missing
2014-08-04T00:02:55Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]
2014-08-04T00:02:55Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]
2014-08-04T00:02:55Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]
llimage/llimage.cpp(174) : 2014-08-04T00:02:55Z error
llimage/llimage.cpp(174) : 2014-08-04T00:02:55Z ERROR: LLImageBase::allocateData: LLImageBase::allocateData called with bad dimensions: 0x0x1

VCB: https://osiris.lindenlab.com/viewer_crash_browser/index.php?filter_id=37231

Expected Behavior

Not crash.

Other information

Links

Related

Original Jira Fields | Field | Value | | ------------- | ------------- | | Issue | BUG-6883 | | Summary | Crash in LLImageBase::getData() due to arbitrary limits in heap management on Windows | | Type | Bug | | Priority | Unset | | Status | Accepted | | Resolution | Accepted | | Reporter | Monty Linden (monty.linden) | | Created at | 2014-08-05T16:11:51Z | | Updated at | 2016-06-27T17:46:23Z | ``` { 'Business Unit': ['Platform'], 'Date of First Response': '2014-08-05T17:00:03.862-0500', 'System': 'SL Viewer', 'Target Viewer Version': 'viewer-development', 'What just happened?': "Reviewing VCB reports of crashes from the library refresh viewer (https://osiris.lindenlab.com/viewer_crash_browser/index.php?filter_id=37217). \r\n\r\nThere's a common crash that's seen in all viewers associated with memory exhaustion in image allocation:\r\n\r\n{noformat}\r\n[Thread-Crashed] (No Notes) Filter on Call Stack\r\n[0] LLError::crashAndLoop(std::basic_string,std::allocator > const &)\r\n/llerror.cpp:1273\r\n[1] boost::detail::function::void_function_invoker1::invoke(boost::detail::function::function_buffer &,LLUUID const &)\r\n/function_template.hpp:153\r\n[2] boost::function1,std::allocator > const &>::operator()(std::basic_string,std::allocator > const &)\r\n/function_template.hpp:767\r\n[3] LLError::Log::flush(std::basic_ostringstream,std::allocator > *,LLError::CallSite const &)\r\n/llerror.cpp:1200\r\n[4] LLImageBase::getData()\r\n/llimage.cpp:247\r\n[5] LLImageJ2CKDU::decodeImpl(LLImageJ2C &,LLImageRaw &,float,int,int)\r\n/llimagej2ckdu.cpp:462\r\n[6] LLImageJ2C::decodeChannels(LLImageRaw *,float,int,int)\r\n/llimagej2c.cpp:182\r\n[7] LLImageJ2C::decode(LLImageRaw *,float)\r\n/llimagej2c.cpp:158\r\n[8] LLImageDecodeThread::ImageRequest::processRequest()\r\n/llimageworker.cpp:145\r\n[9] LLQueuedThread::processNextRequest()\r\n/llqueuedthread.cpp:446\r\n[10] LLQueuedThread::run()\r\n/llqueuedthread.cpp:514\r\n[11] LLThread::staticRun(apr_thread_t *,void *)\r\n/llthread.cpp:142\r\n[12] dummy_worker\r\n/thread.c:79\r\n[13] _callthreadstartex\r\n/threadex.c:314\r\n[14] _threadstartex\r\n/threadex.c:292\r\n[15] BaseThreadStart\r\n/Unknown:0\r\n{noformat}\r\n\r\nVCB: https://osiris.lindenlab.com/viewer_crash_browser/index.php?filter_id=37228\r\n\r\nLog file signature shows allocation failures (which set a flag on the LLImage object) followed by a delayed LLERRS crash when that flag is later tested during ::getData():\r\n\r\n{noformat}\r\n2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]\r\n2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [1572864]\r\n2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]\r\n2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]\r\n2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [3145728]\r\n2014-08-05T14:25:34Z WARNING: LLImageBase::allocateData: Failed to allocate image data size [786432]\r\nllimage/llimage.cpp(247) : 2014-08-05T14:25:34Z error\r\nllimage/llimage.cpp(247) : 2014-08-05T14:25:34Z ERROR: LLImageBase::getData: Bad memory allocation for the image buffer!\r\n{noformat}\r\n\r\nLooking at llmemory.cpp quickly, this seems to be the result of a complicated, fussy and 64-bit-hostile design. Process working sets (this is a windows-only crash) are about 1GB so there's plenty of room, we're just hitting the private pool limit, I think, and dying in unrelated code. Really appears unnecessary.", 'What were you doing when it happened?': 'Normal operations. Viewer tends to be up and running.', 'What were you expecting to happen instead?': 'Not crash.', 'Where': "Many regions, 'Skolldir II' was most common in the sample.", } ```
sl-service-account commented 10 years ago

Nicky Dasmijn commented at 2014-08-05T22:00:04Z

From my experience is that not really a windows specific error. For FS I heard from a quite a few Mac people the same. (I suspect crash reporting on Linux/Mac are not reliable after the breakpad windows OOP changes, but I might be wrong on that).

What in my opinion simply happens here, is that the memory gets fragmented and there's simply not enough continous memory to fullfil the request. The viewer code is not very memory friendly in that regard, as there's a lot of small allocations scattering all over the heap. For which the Win32 heap might be a bit worse then others.

sl-service-account commented 10 years ago

Monty Linden commented at 2014-08-07T19:52:09Z, updated at 2014-08-07T19:55:41Z

@Nicky: All that is certainly possible (Mac, heap) but I'm not seeing any Mac with the signature. That may be because I haven't dug far enough down yet.

It looks like all the private pool stuff in llmemory.cpp is now dead code. Default settings disable it and it appears the crashers haven't accidentally enabled it. (Certain log messages should show up in case of allocation failure in this case.) Would like to report on the condition of process VM under failure here and see if we're not just doing something stupid in allocation like getting in the way of heap binning by the allocator.

Another thing specific to LLImage is that this allocation failure is lazy: while logging occurs immediately, failure is deferred for later (despite being inevitable). Then the failure happens all over: in KDU decode, in image scaling, in other operations. What would have been an unambiguous allocation error is now a dozen different failures. sigh

sl-service-account commented 10 years ago

Nicky Dasmijn commented at 2014-08-07T21:18:32Z

I did speak with a few Mac peeps that had this issues (not enough for statistic relevance) And the Jira Brain told me, that Mac people suffer quite a bit from it. The private memory pool is dead indeed (which is a good thing) and everything is handled by the OS.

Regarding making the viewer more resilient against out of memory ... good luck with that :| It's like a journey around the world. BTDT. You come from the small to the big and all around over and over.

sl-service-account commented 9 years ago

Whirly Fizzle commented at 2015-04-27T22:19:18Z

Possible case on Mac? BUG-9150

sl-service-account commented 9 years ago

Whirly Fizzle commented at 2015-08-24T19:39:35Z

BUG-9959 looks like a case of this?