korli / qtwebengine-haiku

BSD 3-Clause "New" or "Revised" License
0 stars 3 forks source link

Crashes on YouTube #5

Open waddlesplash opened 3 years ago

waddlesplash commented 3 years ago

This is with 5.15.2.

Seems to happen "most of" the time. Notably, the stack trace is completely different each time, at least to my eyes it hasn't looked the same twice, neither with the top function nor with functions lower down. Sometimes it has what appears to be an invalid memory access, other times it clearly has a NULL dereference.

HTML5 video & audio is working on other sites, though, so it may not be media related.

waddlesplash commented 3 years ago

vimeo.com doesn't crash, but it also doesn't really work (I got "Player Error" one time, and then a 'working' player the second time, but it had only audio and no video.)

threedeyes commented 3 years ago

vimeo.com doesn't crash, but it also doesn't really work (I got "Player Error" one time, and then a 'working' player the second time, but it had only audio and no video.)

Many sites will not play multimedia due to lack of necessary codecs - h264 and h265 and mp4-p14. They are only enabled when you build qtwebengine with the -proprietary-codecs switches.

threedeyes commented 3 years ago

https://doc.qt.io/qt-5/qtwebengine-features.html

threedeyes commented 3 years ago

I rebuilt with this option. The OpenH264 build is enabled. Now the radio is playing successfully on https://www.di.fm and the video on https://vimeo.com is working.

threedeyes commented 3 years ago

This raises the question of the legality of the inclusion of this option. What shall we do?

waddlesplash commented 3 years ago

We already include patented codecs in FFmpeg (MP4 decoder, etc.) and WebKit supports them. I think we can enable this for now, anyway, and if it turns out to be a problem we can revisit that later.

diversys commented 3 years ago

Notably, the stack trace is completely different each time, at least to my eyes it hasn't looked the same twice, neither with the top function nor with functions lower down

Here are top crashers from about 18 attempts to load youtube (first number is how many times it crashed in that function):

grep stack.trace /var/log/syslog | awk {'print $7'} | sort | c++filt  | uniq -c | sort -r
      5 blink::HTMLConstructionSite::ExecuteTask(blink::HTMLConstructionSiteTask&)
      2 blink::ExecutionContextLifecycleObserver::GetExecutionContext() const
      2 blink::ContainerNode::EnsurePreInsertionValidity(blink::Node const&, blink::Node const*, blink::Node const*, blink::ExceptionState&) const
      1 logging::LogMessage::~LogMessage()
      1 blink::V8CodeCache::SetCacheTimeStamp(blink::SingleCachedMetadataHandler*)
      1 blink::V8CodeCache::ProduceCache(v8::Isolate*, v8::Local<v8::Script>, blink::ScriptSourceCode const&, blink::V8CodeCache::ProduceCacheOptions)
      1 blink::SVGDocumentExtensions::ServiceAnimations()
      1 blink::ScheduledAction::Dispose()
      1 blink::Node::ownerDocument() const
      1 blink::MarkingVisitorBase<blink::ConcurrentMarkingVisitor>::Visit(void const*, blink::TraceDescriptor)
      1 blink::FetchEvent::FetchEvent(blink::ScriptState*, WTF::AtomicString const&, blink::FetchEventInit const*, blink::FetchRespondWithObserver*, blink::WaitUntilObserver*, mojo::PendingRemote<blink::mojom::blink::WorkerTimingContainer>, bool)
      1 blink::Element::AttributeChanged(blink::Element::AttributeModificationParams const&)
      1 blink::ContainerNode::RecalcDescendantStyles(blink::StyleRecalcChange)
diversys commented 3 years ago

The list is different for @threedeyes:

image

diversys commented 3 years ago

We already include patented codecs in FFmpeg (MP4 decoder, etc.) and WebKit supports them. I think we can enable this for now, anyway, and if it turns out to be a problem we can revisit that later.

Should we then enable x264 too? https://github.com/haikuports/haikuports/issues/4919

waddlesplash commented 2 years ago

So, I took a closer look at one of the crash reports.

    thread 2208: QtWebEngineProcess (main)
        state: Exception (Segment violation)

        Frame       IP          Function Name
        -----------------------------------------------
        0x7fbc86dd7180  0x3ab175be9c    blink::ResourceLoader::ResourceLoader(blink::ResourceFetcher*, blink::ResourceLoadScheduler*, blink::Resource*, blink::ResourceRequestBody, unsigned int) + 0x26c 

            Disassembly:
....
                0x0000003ab175be9c:     8b9ff0020000  mov 0x2f0(%rdi), %ebx <--
              rdi:  0x000008906d6e1a70
    26698   0x8906d6e0000   0x8906d820000         1280     rw-      none            libQt5WebEngineCore.so.5 mmap a

So, the area exists and has RW permissions, so we shouldn't have a fault. What's gone wrong here, then, I wonder?

waddlesplash commented 2 years ago

From a syslog (different crash, same site):

KERN: user access on kernel area 0x9596 at 0x00001702511e3000
KERN: vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0x1702511e30a0, ip 0x182cf7258b6, write 1, user 1, exec 0, thread 0x1178

While I don't have an areas listing it seems quite probable that area also exists. So perhaps our memory management is mishandling permissions.

waddlesplash commented 2 years ago

Here's an even closer look at one of the crashes:

read access attempted on read-protected area 0xde8d at 0x000036fcd0b42000
vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0x36fcd0b42a60, ip 0x1811be0710b, write 0, user 1, exec 0, thread 0x1f01
    thread 7937: QtWebEngineProcess (main)
        state: Exception (Segment violation)

        Frame       IP          Function Name
        -----------------------------------------------
        0x7fe5ede7acf0  0x1811be0710b   blink::RuleSet::AddToRuleSet(WTF::AtomicString const&, blink::HeapHashMap<WTF::AtomicString, blink::Member<blink::HeapLinkedStack<blink::Member<blink::RuleData const> > >, WTF::AtomicStringHash, WTF::HashTraits<WTF::AtomicString>, WTF::HashTraits<blink::Member<blink::HeapLinkedStack<blink::Member<blink::RuleData const> > > > >&, blink::RuleData const*) + 0x4b 
            Disassembly:
                blink::RuleSet::AddToRuleSet(WTF::AtomicString const&, blink::HeapHashMap<WTF::AtomicString, blink::Member<blink::HeapLinkedStack<blink::Member<blink::RuleData const> > >, WTF::AtomicStringHash, WTF::HashTraits<WTF::AtomicString>, WTF::HashTraits<blink::Member<blink::HeapLinkedStack<blink::Member<blink::RuleData const> > > > >&, blink::RuleData const*):
...
                0x000001811be0710b:         498b0c24  mov (%r12), %rcx <--
              r12:  0x000036fcd0b42a60
kdebug> area 0xde8d
AREA: 0xffffffff976f9c00
name:       'libQt5WebEngineCore.so.5 mmap a'
owner:      0x1f01
id:     0xde8d
base:       0x36fcd0ac0000
size:       0x140000
protection: 0x30
page_protection:0xffffffff97ba18e8
wiring:     0x0
memory_type:    0x0
cache:      0xffffffff975486c8
cache_type: RAM
cache_offset:   0x0
cache_next: 0x0000000000000000
cache_prev: 0x0000000000000000
page mappings:  183
kdebug> db 0xffffffff97ba18e8 160
[0xffffffff97ba18e8]  033333333333333.   30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
[0xffffffff97ba18f8]  ................   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[0xffffffff97ba1908]  ................   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[0xffffffff97ba1918]  ................   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[0xffffffff97ba1928]  ................   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[0xffffffff97ba1938]  033333333333333.   30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
[0xffffffff97ba1948]  033333333333333.   30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
[0xffffffff97ba1958]  033333333333333.   30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
[0xffffffff97ba1968]  033333333333333.   30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
[0xffffffff97ba1978]  033333333333333.   30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
waddlesplash commented 2 years ago

0x000036fcd0b42000-0x36fcd0ac0000=0x82000; then /0x1000 (B_PAGE_SIZE) = 0x82, then /2 = 0x41 or 65, so the protections should be at the high 4 bits of 0xFFFFFFFF97BA1929, which are indeed 0 in the above array.

So, why did Blink try to access an area it set PROT_NONE on?

waddlesplash commented 2 years ago

I added a debugger invocation on mprotect when page_protections is specified as 0 (and more than 2 pages are specified in size) and all the invocations looked normal (i.e. invoked by the blink page_allocator) up until the crash. So maybe the problem is really just that the page isn't getting set as actually used? So, I guess the question is what is going wrong for that not to happen?

waddlesplash commented 2 years ago

I added some hacks to the syscall to always have B_READ_AREA | B_WRITE_AREA protections. Now instead, I get stuff like:

KERN: vm_soft_fault: va 0x2000000000 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x200000004b, ip 0xe1695cf4c2, write 0, user 1, exec 0, thread 0x8bd

So, it seems there is just some other kind of memory management issue going on.

korli commented 2 years ago

0x2000000000 is 128GB, weird.

waddlesplash commented 2 years ago

https://genodians.org/nfeske/2022-01-27-browser-odyssey

Even with the proprietary codes enabled, we faced a long standing issue with accessing YouTube on ARM. The Javascript engine would crash not long after YouTube presented its cookie-confirmation dialog. Our investigation led us into the inner guts of the Javascript's JIT compiler, which produced binary code with inter-thread synchronization issues. We ultimately found that the problematic code is actually never used in the Linux version of Falkon. It turns out that the difference was caused by the optimization webengine-v8-snapshot. Early on during our porting effort, we deliberately disabled this optimization to ease our life. With the optimization enabled, frequently used Javascript code is pre-compiled at the compilation time of the browser. The pre-compiled code works correctly. With the optimization disabled, the problem occurs on Linux as well. Hence we could side-step the problem by enabling the optimization.

This sounds relevant. Do we have that optimization disabled, maybe?

waddlesplash commented 2 years ago

According to this JSON file, V8 Snapshot support might depend on having a 32-bit host compiler -- so maybe it is disabled indeed?

korli commented 2 years ago

V8 Snapshot support is enabled, see https://github.com/korli/qtwebengine-haiku/blob/haiku-5.15.7/src/core/configure.json#L111

the warning is about cross-compiling.

waddlesplash commented 2 years ago

But are any snapshots actually built and used -- is there some way to confirm that?

waddlesplash commented 2 years ago

There appear to be some usages of static thread_local variables in V8, for example (but there are a lot more than just this one): https://github.com/korli/qtwebengine-haiku/blob/d95937135297d7b0439b6e919c89892e80de48b8/src/3rdparty/chromium/v8/src/torque/contextual.h#L88

Perhaps this is the same problem GNUTLS ran into? But @X547 did not manage to make a minimal testcase for that, even...

waddlesplash commented 2 years ago

Another in tcmalloc itself: https://github.com/korli/qtwebengine-haiku/blob/d95937135297d7b0439b6e919c89892e80de48b8/src/3rdparty/chromium/third_party/tcmalloc/chromium/src/thread_cache.h#L272

waddlesplash commented 1 year ago

Using the user_mutex refactor patches, I get different behavior now here. Most of the time, I just get a deadlock before page loading finishes (which could well be a lingering bug in the new user_mutex code, but, I'm not entirely sure about that.) But some of the time, I get an assertion-failure crash like this:

    thread 3597: pthread func 
        state: Call (mutex was not actually locked!)

        Frame       IP          Function Name
        -----------------------------------------------
        00000000    0xce64ab8da7    _kern_debugger + 0x7 
            Disassembly:
                _kern_debugger:
                0x000000ce64ab8da0:   48c7c0e5000000  mov $0xe5, %rax
                0x000000ce64ab8da7:             0f05  syscall  <--

        0x7f6d30f69b60  0xce64b42d98    BPrivate::processHeap::free(void*) + 0x138 
        0x7f6d30f69b90  0xce64b43f84    free + 0x44 
        0x7f6d30f69bf0  0x14afeb6e206   _ZNSt6vectorISt4pairIt13scoped_refptrIN2cc4TaskEEESaIS5_EE17_M_realloc_insertIJS5_EEEvN9__gnu_cxx17(, ) + 0x116 
        0x7f6d30f69c60  0x14afeb703c9   cc::TaskGraphWorkQueue::GetNextTaskToRun(unsigned short) + 0x269 
        0x7f6d30f69d30  0x14b0059dd0b   content::CategorizedWorkerPool::RunTaskInCategoryWithLockAcquired(cc::TaskCategory) + 0x2b 
        0x7f6d30f69d80  0x14b0059dfea   content::CategorizedWorkerPool::Run(std::vector<cc::TaskCategory, std::allocator<cc::TaskCategory> > const&, base::ConditionVariable*) + 0x6a 
        0x7f6d30f69db0  0x14afe2e5b48   base::_GLOBAL__N_1::ThreadFunc(void*) + 0x48 
        0x7f6d30f69dd0  0xce64ac7105    pthread_thread_entry(void*, void*) + 0x15 
        00000000    0x7faeead5b258  commpage_thread_exit + 0 

All the ones I have gotten so far with YouTube have that exact stack trace. There are some other websites that trigger this crash, with slightly different stack traces, but still similar. For example, CanvasMark seems to trigger it very reliably around "Test 2".

I haven't actually completed a full page load or benchmark run because it either deadlocks or crashes. The deadlocks are in code which looks related. For example, here's one of the hung threads:

 2 ffffffff827add60 (+ 112) ffffffff80057d45   <kernel_x86_64> ConditionVariableEntry::Wait(unsigned int, long) + 0x155
 3 ffffffff827addd0 (+ 336) ffffffff8009a930   <kernel_x86_64> _user_mutex_switch_lock + 0x470
 4 ffffffff827adf20 (+  16) ffffffff8014987f   <kernel_x86_64> x86_64_syscall_entry + 0xfb
user iframe at 0xffffffff827adf30 (end = 0xffffffff827adff8)
 rax 0x9                   rbx 0x122781572c10        rcx 0x143fc1d6fec
 rdx 0x122781572c24        rsi 0x0                   rdi 0x122781572b24
 rbp 0x7f4fcd04c550         r8 0x0                    r9 0x7fffffffffffffff
 r10 0x143fc2737e8         r11 0x246                 r12 0x122781572b20
 r13 0x0                   r14 0x0                   r15 0x7fffffffffffffff
 rip 0x143fc1d6fec         rsp 0x7f4fcd04c518     rflags 0x246
 vector: 0x63, error code: 0x0
 5 ffffffff827adf30 (+139982824662560) 00000143fc1d6fec   <libroot.so> _kern_mutex_switch_lock + 0x0c
 6 00007f4fcd04c550 (+  32) 00000143fc1e6d0b   <libroot.so> pthread_cond_wait + 0x1b
 7 00007f4fcd04c570 (+ 144) 000001a668418768   <libQt5WebEngineCore.so.5> _ZNSt6vectorIPcSaIS0_EE7reserveEm (nearest) + 0x1c28
 8 00007f4fcd04c600 (+  80) 000001a66a6d0039   <libQt5WebEngineCore.so.5> _ZNSt10_HashtableIjjSaIjENSt8__detail9_IdentityESt8equal_toIjESt4hashIjENS1_18_Mod_range_hashingENS1_20_Default_ranged_hashENS1_20_Prime_rehash_policyENS1_17_Hashtable_traitsILb0ELb1ELb1EEEE8_M_eraseEmPNS1_15_Hash_node_baseEPNS1_10_Hash_nodeIjLb0EEE (nearest) + 0x8299

The mutex it is waiting for is indeed locked (0x3). There are a lot of threads hung like this on a lot of different locks, I couldn't find any that seemed to be the root of the problem.

waddlesplash commented 1 year ago

Ha, it's not just under the new code, I can reproduce the deadlocks on CanvasMark at least under hrev57080. But that's after the first round of fixes, anyway.

waddlesplash commented 1 year ago

I've actually managed to trigger that free mutex assertion with GTKWebKit (not with HaikuWebKit yet, at least.) There it's in JSC::DFG::BasicBlock::~BasicBlock.

waddlesplash commented 1 year ago

Following the fixes in hrev57197, deadlocks are much less frequent (if they happen at all.)

One thing I have consistently noticed is that, if the YouTube video thumbnails start loading before the crash dialog appears, you can actually use YouTube like normal (play videos, etc.) despite the crash dialog being present...

Of course, once you choose "Terminate", the tab closes entirely. But it seems pretty weird that some of the "crashes" don't actually prevent you from browsing normally, until the process is actually terminated.

diversys commented 1 year ago

I tried opening youtube in hrev57197, however it only managed to load video thumbnails once out of ~100 times.

waddlesplash commented 1 year ago

More like 1 in 5-10 here.