spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
770 stars 129 forks source link

Out of memory crash for some Wikivoyage pages #555

Closed zhibek closed 11 months ago

zhibek commented 11 months ago

Processing some Wikivoyage pages results in an out of memory crash. Example URL: https://en.wikivoyage.org/wiki/Interstate_5

Minimal example:

import wtf from 'wtf_wikipedia';

const doc = await wtf.fetch('https://en.wikivoyage.org/wiki/Interstate_5');

console.log(doc.title());

Results in error:

<--- Last few GCs --->

[270573:0x59a7070]    83740 ms: Mark-sweep 2013.2 (2089.1) -> 2013.2 (2089.1) MB, 17.7 / 0.0 ms  (average mu = 0.991, current mu = 0.980) allocation failure scavenge might not succeed
[270573:0x59a7070]    83895 ms: Mark-sweep 2016.6 (2092.6) -> 2016.6 (2108.6) MB, 17.9 / 0.0 ms  (average mu = 0.984, current mu = 0.884) allocation failure scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb090e0 node::Abort() [/usr/bin/node]
 2: 0xa1b70e  [/usr/bin/node]
 3: 0xce1a20 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/bin/node]
 4: 0xce1dc7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/bin/node]
 5: 0xe99435  [/usr/bin/node]
 6: 0xea90fd v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/bin/node]
 7: 0xeabdfe v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/bin/node]
 8: 0xe6d072 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/usr/bin/node]
 9: 0xe65684 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [/usr/bin/node]
10: 0xe676e1 v8::internal::FactoryBase<v8::internal::Factory>::NewRawTwoByteString(int, v8::internal::AllocationType) [/usr/bin/node]
11: 0x10f8915 v8::internal::String::SlowFlatten(v8::internal::Isolate*, v8::internal::Handle<v8::internal::ConsString>, v8::internal::AllocationType) [/usr/bin/node]
12: 0x11d4003 v8::internal::RegExpImpl::IrregexpExec(v8::internal::Isolate*, v8::internal::Handle<v8::internal::JSRegExp>, v8::internal::Handle<v8::internal::String>, int, v8::internal::Handle<v8::internal::RegExpMatchInfo>, v8::internal::RegExp::ExecQuirks) [/usr/bin/node]
13: 0x11f9088 v8::internal::Runtime_RegExpExec(int, unsigned long*, v8::internal::Isolate*) [/usr/bin/node]
14: 0x15d9e59  [/usr/bin/node]
Aborted (core dumped)
error Command failed with exit code 134.

Tested on a laptop with 16GB RAM running Ubuntu 22.04 / Node 16.20.2.

spencermountain commented 11 months ago

Whoaaaaa! Busy tomorrow, but will look at this first thing Wednesday. Thank you for letting me know.

spencermountain commented 11 months ago

hey John, this is fixed now in 10.1.7 - thanks for the heads-up. someone put a 15-thousand-line geojson blob in the wikivoyage markup, using a deprecated template. Maybe there's a clever fix, but I've just ignored the maplink xml template, for now. cheers

zhibek commented 11 months ago

Thanks @spencermountain. Seeing your PR, I understand the library better now. If I see any similar issues in the future I'll aim to contribute a PR to fix.

FYI, confirming your fix looks good to me. The following URLs are now processed without crashing (or very slow parsing):