Open waddlesplash opened 3 years ago
vimeo.com doesn't crash, but it also doesn't really work (I got "Player Error" one time, and then a 'working' player the second time, but it had only audio and no video.)
vimeo.com doesn't crash, but it also doesn't really work (I got "Player Error" one time, and then a 'working' player the second time, but it had only audio and no video.)
Many sites will not play multimedia due to lack of necessary codecs - h264 and h265 and mp4-p14. They are only enabled when you build qtwebengine with the -proprietary-codecs switches.
I rebuilt with this option. The OpenH264 build is enabled. Now the radio is playing successfully on https://www.di.fm and the video on https://vimeo.com is working.
This raises the question of the legality of the inclusion of this option. What shall we do?
We already include patented codecs in FFmpeg (MP4 decoder, etc.) and WebKit supports them. I think we can enable this for now, anyway, and if it turns out to be a problem we can revisit that later.
Notably, the stack trace is completely different each time, at least to my eyes it hasn't looked the same twice, neither with the top function nor with functions lower down
Here are top crashers from about 18 attempts to load youtube (first number is how many times it crashed in that function):
grep stack.trace /var/log/syslog | awk {'print $7'} | sort | c++filt | uniq -c | sort -r
5 blink::HTMLConstructionSite::ExecuteTask(blink::HTMLConstructionSiteTask&)
2 blink::ExecutionContextLifecycleObserver::GetExecutionContext() const
2 blink::ContainerNode::EnsurePreInsertionValidity(blink::Node const&, blink::Node const*, blink::Node const*, blink::ExceptionState&) const
1 logging::LogMessage::~LogMessage()
1 blink::V8CodeCache::SetCacheTimeStamp(blink::SingleCachedMetadataHandler*)
1 blink::V8CodeCache::ProduceCache(v8::Isolate*, v8::Local<v8::Script>, blink::ScriptSourceCode const&, blink::V8CodeCache::ProduceCacheOptions)
1 blink::SVGDocumentExtensions::ServiceAnimations()
1 blink::ScheduledAction::Dispose()
1 blink::Node::ownerDocument() const
1 blink::MarkingVisitorBase<blink::ConcurrentMarkingVisitor>::Visit(void const*, blink::TraceDescriptor)
1 blink::FetchEvent::FetchEvent(blink::ScriptState*, WTF::AtomicString const&, blink::FetchEventInit const*, blink::FetchRespondWithObserver*, blink::WaitUntilObserver*, mojo::PendingRemote<blink::mojom::blink::WorkerTimingContainer>, bool)
1 blink::Element::AttributeChanged(blink::Element::AttributeModificationParams const&)
1 blink::ContainerNode::RecalcDescendantStyles(blink::StyleRecalcChange)
The list is different for @threedeyes:
We already include patented codecs in FFmpeg (MP4 decoder, etc.) and WebKit supports them. I think we can enable this for now, anyway, and if it turns out to be a problem we can revisit that later.
Should we then enable x264 too? https://github.com/haikuports/haikuports/issues/4919
So, I took a closer look at one of the crash reports.
thread 2208: QtWebEngineProcess (main)
state: Exception (Segment violation)
Frame IP Function Name
-----------------------------------------------
0x7fbc86dd7180 0x3ab175be9c blink::ResourceLoader::ResourceLoader(blink::ResourceFetcher*, blink::ResourceLoadScheduler*, blink::Resource*, blink::ResourceRequestBody, unsigned int) + 0x26c
Disassembly:
....
0x0000003ab175be9c: 8b9ff0020000 mov 0x2f0(%rdi), %ebx <--
rdi: 0x000008906d6e1a70
26698 0x8906d6e0000 0x8906d820000 1280 rw- none libQt5WebEngineCore.so.5 mmap a
So, the area exists and has RW permissions, so we shouldn't have a fault. What's gone wrong here, then, I wonder?
From a syslog (different crash, same site):
KERN: user access on kernel area 0x9596 at 0x00001702511e3000
KERN: vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0x1702511e30a0, ip 0x182cf7258b6, write 1, user 1, exec 0, thread 0x1178
While I don't have an areas listing it seems quite probable that area also exists. So perhaps our memory management is mishandling permissions.
Here's an even closer look at one of the crashes:
read access attempted on read-protected area 0xde8d at 0x000036fcd0b42000
vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0x36fcd0b42a60, ip 0x1811be0710b, write 0, user 1, exec 0, thread 0x1f01
thread 7937: QtWebEngineProcess (main)
state: Exception (Segment violation)
Frame IP Function Name
-----------------------------------------------
0x7fe5ede7acf0 0x1811be0710b blink::RuleSet::AddToRuleSet(WTF::AtomicString const&, blink::HeapHashMap<WTF::AtomicString, blink::Member<blink::HeapLinkedStack<blink::Member<blink::RuleData const> > >, WTF::AtomicStringHash, WTF::HashTraits<WTF::AtomicString>, WTF::HashTraits<blink::Member<blink::HeapLinkedStack<blink::Member<blink::RuleData const> > > > >&, blink::RuleData const*) + 0x4b
Disassembly:
blink::RuleSet::AddToRuleSet(WTF::AtomicString const&, blink::HeapHashMap<WTF::AtomicString, blink::Member<blink::HeapLinkedStack<blink::Member<blink::RuleData const> > >, WTF::AtomicStringHash, WTF::HashTraits<WTF::AtomicString>, WTF::HashTraits<blink::Member<blink::HeapLinkedStack<blink::Member<blink::RuleData const> > > > >&, blink::RuleData const*):
...
0x000001811be0710b: 498b0c24 mov (%r12), %rcx <--
r12: 0x000036fcd0b42a60
kdebug> area 0xde8d
AREA: 0xffffffff976f9c00
name: 'libQt5WebEngineCore.so.5 mmap a'
owner: 0x1f01
id: 0xde8d
base: 0x36fcd0ac0000
size: 0x140000
protection: 0x30
page_protection:0xffffffff97ba18e8
wiring: 0x0
memory_type: 0x0
cache: 0xffffffff975486c8
cache_type: RAM
cache_offset: 0x0
cache_next: 0x0000000000000000
cache_prev: 0x0000000000000000
page mappings: 183
kdebug> db 0xffffffff97ba18e8 160
[0xffffffff97ba18e8] 033333333333333. 30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
[0xffffffff97ba18f8] ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[0xffffffff97ba1908] ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[0xffffffff97ba1918] ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[0xffffffff97ba1928] ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[0xffffffff97ba1938] 033333333333333. 30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
[0xffffffff97ba1948] 033333333333333. 30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
[0xffffffff97ba1958] 033333333333333. 30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
[0xffffffff97ba1968] 033333333333333. 30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
[0xffffffff97ba1978] 033333333333333. 30 33 33 33 33 33 33 33 33 33 33 33 33 33 33 03
0x000036fcd0b42000-0x36fcd0ac0000=0x82000
; then /0x1000
(B_PAGE_SIZE) = 0x82
, then /2 = 0x41
or 65
, so the protections should be at the high 4 bits of 0xFFFFFFFF97BA1929
, which are indeed 0
in the above array.
So, why did Blink try to access an area it set PROT_NONE on?
I added a debugger
invocation on mprotect
when page_protections is specified as 0
(and more than 2 pages are specified in size
) and all the invocations looked normal (i.e. invoked by the blink page_allocator) up until the crash. So maybe the problem is really just that the page isn't getting set as actually used? So, I guess the question is what is going wrong for that not to happen?
I added some hacks to the syscall to always have B_READ_AREA | B_WRITE_AREA
protections. Now instead, I get stuff like:
KERN: vm_soft_fault: va 0x2000000000 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x200000004b, ip 0xe1695cf4c2, write 0, user 1, exec 0, thread 0x8bd
So, it seems there is just some other kind of memory management issue going on.
0x2000000000 is 128GB, weird.
https://genodians.org/nfeske/2022-01-27-browser-odyssey
Even with the proprietary codes enabled, we faced a long standing issue with accessing YouTube on ARM. The Javascript engine would crash not long after YouTube presented its cookie-confirmation dialog. Our investigation led us into the inner guts of the Javascript's JIT compiler, which produced binary code with inter-thread synchronization issues. We ultimately found that the problematic code is actually never used in the Linux version of Falkon. It turns out that the difference was caused by the optimization webengine-v8-snapshot. Early on during our porting effort, we deliberately disabled this optimization to ease our life. With the optimization enabled, frequently used Javascript code is pre-compiled at the compilation time of the browser. The pre-compiled code works correctly. With the optimization disabled, the problem occurs on Linux as well. Hence we could side-step the problem by enabling the optimization.
This sounds relevant. Do we have that optimization disabled, maybe?
According to this JSON file, V8 Snapshot support might depend on having a 32-bit host compiler -- so maybe it is disabled indeed?
V8 Snapshot support is enabled, see https://github.com/korli/qtwebengine-haiku/blob/haiku-5.15.7/src/core/configure.json#L111
the warning is about cross-compiling.
But are any snapshots actually built and used -- is there some way to confirm that?
There appear to be some usages of static thread_local variables in V8, for example (but there are a lot more than just this one): https://github.com/korli/qtwebengine-haiku/blob/d95937135297d7b0439b6e919c89892e80de48b8/src/3rdparty/chromium/v8/src/torque/contextual.h#L88
Perhaps this is the same problem GNUTLS ran into? But @X547 did not manage to make a minimal testcase for that, even...
Using the user_mutex
refactor patches, I get different behavior now here. Most of the time, I just get a deadlock before page loading finishes (which could well be a lingering bug in the new user_mutex
code, but, I'm not entirely sure about that.) But some of the time, I get an assertion-failure crash like this:
thread 3597: pthread func
state: Call (mutex was not actually locked!)
Frame IP Function Name
-----------------------------------------------
00000000 0xce64ab8da7 _kern_debugger + 0x7
Disassembly:
_kern_debugger:
0x000000ce64ab8da0: 48c7c0e5000000 mov $0xe5, %rax
0x000000ce64ab8da7: 0f05 syscall <--
0x7f6d30f69b60 0xce64b42d98 BPrivate::processHeap::free(void*) + 0x138
0x7f6d30f69b90 0xce64b43f84 free + 0x44
0x7f6d30f69bf0 0x14afeb6e206 _ZNSt6vectorISt4pairIt13scoped_refptrIN2cc4TaskEEESaIS5_EE17_M_realloc_insertIJS5_EEEvN9__gnu_cxx17(, ) + 0x116
0x7f6d30f69c60 0x14afeb703c9 cc::TaskGraphWorkQueue::GetNextTaskToRun(unsigned short) + 0x269
0x7f6d30f69d30 0x14b0059dd0b content::CategorizedWorkerPool::RunTaskInCategoryWithLockAcquired(cc::TaskCategory) + 0x2b
0x7f6d30f69d80 0x14b0059dfea content::CategorizedWorkerPool::Run(std::vector<cc::TaskCategory, std::allocator<cc::TaskCategory> > const&, base::ConditionVariable*) + 0x6a
0x7f6d30f69db0 0x14afe2e5b48 base::_GLOBAL__N_1::ThreadFunc(void*) + 0x48
0x7f6d30f69dd0 0xce64ac7105 pthread_thread_entry(void*, void*) + 0x15
00000000 0x7faeead5b258 commpage_thread_exit + 0
All the ones I have gotten so far with YouTube have that exact stack trace. There are some other websites that trigger this crash, with slightly different stack traces, but still similar. For example, CanvasMark seems to trigger it very reliably around "Test 2".
I haven't actually completed a full page load or benchmark run because it either deadlocks or crashes. The deadlocks are in code which looks related. For example, here's one of the hung threads:
2 ffffffff827add60 (+ 112) ffffffff80057d45 <kernel_x86_64> ConditionVariableEntry::Wait(unsigned int, long) + 0x155
3 ffffffff827addd0 (+ 336) ffffffff8009a930 <kernel_x86_64> _user_mutex_switch_lock + 0x470
4 ffffffff827adf20 (+ 16) ffffffff8014987f <kernel_x86_64> x86_64_syscall_entry + 0xfb
user iframe at 0xffffffff827adf30 (end = 0xffffffff827adff8)
rax 0x9 rbx 0x122781572c10 rcx 0x143fc1d6fec
rdx 0x122781572c24 rsi 0x0 rdi 0x122781572b24
rbp 0x7f4fcd04c550 r8 0x0 r9 0x7fffffffffffffff
r10 0x143fc2737e8 r11 0x246 r12 0x122781572b20
r13 0x0 r14 0x0 r15 0x7fffffffffffffff
rip 0x143fc1d6fec rsp 0x7f4fcd04c518 rflags 0x246
vector: 0x63, error code: 0x0
5 ffffffff827adf30 (+139982824662560) 00000143fc1d6fec <libroot.so> _kern_mutex_switch_lock + 0x0c
6 00007f4fcd04c550 (+ 32) 00000143fc1e6d0b <libroot.so> pthread_cond_wait + 0x1b
7 00007f4fcd04c570 (+ 144) 000001a668418768 <libQt5WebEngineCore.so.5> _ZNSt6vectorIPcSaIS0_EE7reserveEm (nearest) + 0x1c28
8 00007f4fcd04c600 (+ 80) 000001a66a6d0039 <libQt5WebEngineCore.so.5> _ZNSt10_HashtableIjjSaIjENSt8__detail9_IdentityESt8equal_toIjESt4hashIjENS1_18_Mod_range_hashingENS1_20_Default_ranged_hashENS1_20_Prime_rehash_policyENS1_17_Hashtable_traitsILb0ELb1ELb1EEEE8_M_eraseEmPNS1_15_Hash_node_baseEPNS1_10_Hash_nodeIjLb0EEE (nearest) + 0x8299
The mutex it is waiting for is indeed locked (0x3
). There are a lot of threads hung like this on a lot of different locks, I couldn't find any that seemed to be the root of the problem.
Ha, it's not just under the new code, I can reproduce the deadlocks on CanvasMark at least under hrev57080. But that's after the first round of fixes, anyway.
I've actually managed to trigger that free
mutex assertion with GTKWebKit (not with HaikuWebKit yet, at least.) There it's in JSC::DFG::BasicBlock::~BasicBlock
.
Following the fixes in hrev57197
, deadlocks are much less frequent (if they happen at all.)
One thing I have consistently noticed is that, if the YouTube video thumbnails start loading before the crash dialog appears, you can actually use YouTube like normal (play videos, etc.) despite the crash dialog being present...
Of course, once you choose "Terminate", the tab closes entirely. But it seems pretty weird that some of the "crashes" don't actually prevent you from browsing normally, until the process is actually terminated.
I tried opening youtube in hrev57197, however it only managed to load video thumbnails once out of ~100 times.
More like 1 in 5-10 here.
This is with 5.15.2.
Seems to happen "most of" the time. Notably, the stack trace is completely different each time, at least to my eyes it hasn't looked the same twice, neither with the top function nor with functions lower down. Sometimes it has what appears to be an invalid memory access, other times it clearly has a NULL dereference.
HTML5 video & audio is working on other sites, though, so it may not be media related.