spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
779 stars 129 forks source link

Out of memory crash for some Wikivoyage pages #555

Closed zhibek closed 1 year ago

zhibek commented 1 year ago

Processing some Wikivoyage pages results in an out of memory crash. Example URL: https://en.wikivoyage.org/wiki/Interstate_5

Minimal example:

import wtf from 'wtf_wikipedia';

const doc = await wtf.fetch('https://en.wikivoyage.org/wiki/Interstate_5');

console.log(doc.title());

Results in error:

<--- Last few GCs --->

[270573:0x59a7070]    83740 ms: Mark-sweep 2013.2 (2089.1) -> 2013.2 (2089.1) MB, 17.7 / 0.0 ms  (average mu = 0.991, current mu = 0.980) allocation failure scavenge might not succeed
[270573:0x59a7070]    83895 ms: Mark-sweep 2016.6 (2092.6) -> 2016.6 (2108.6) MB, 17.9 / 0.0 ms  (average mu = 0.984, current mu = 0.884) allocation failure scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb090e0 node::Abort() [/usr/bin/node]
 2: 0xa1b70e  [/usr/bin/node]
 3: 0xce1a20 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/bin/node]
 4: 0xce1dc7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/bin/node]
 5: 0xe99435  [/usr/bin/node]
 6: 0xea90fd v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/bin/node]
 7: 0xeabdfe v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/bin/node]
 8: 0xe6d072 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/usr/bin/node]
 9: 0xe65684 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [/usr/bin/node]
10: 0xe676e1 v8::internal::FactoryBase<v8::internal::Factory>::NewRawTwoByteString(int, v8::internal::AllocationType) [/usr/bin/node]
11: 0x10f8915 v8::internal::String::SlowFlatten(v8::internal::Isolate*, v8::internal::Handle<v8::internal::ConsString>, v8::internal::AllocationType) [/usr/bin/node]
12: 0x11d4003 v8::internal::RegExpImpl::IrregexpExec(v8::internal::Isolate*, v8::internal::Handle<v8::internal::JSRegExp>, v8::internal::Handle<v8::internal::String>, int, v8::internal::Handle<v8::internal::RegExpMatchInfo>, v8::internal::RegExp::ExecQuirks) [/usr/bin/node]
13: 0x11f9088 v8::internal::Runtime_RegExpExec(int, unsigned long*, v8::internal::Isolate*) [/usr/bin/node]
14: 0x15d9e59  [/usr/bin/node]
Aborted (core dumped)
error Command failed with exit code 134.

Tested on a laptop with 16GB RAM running Ubuntu 22.04 / Node 16.20.2.

spencermountain commented 1 year ago

Whoaaaaa! Busy tomorrow, but will look at this first thing Wednesday. Thank you for letting me know.

spencermountain commented 1 year ago

hey John, this is fixed now in 10.1.7 - thanks for the heads-up. someone put a 15-thousand-line geojson blob in the wikivoyage markup, using a deprecated template. Maybe there's a clever fix, but I've just ignored the maplink xml template, for now. cheers

zhibek commented 1 year ago

Thanks @spencermountain. Seeing your PR, I understand the library better now. If I see any similar issues in the future I'll aim to contribute a PR to fix.

FYI, confirming your fix looks good to me. The following URLs are now processed without crashing (or very slow parsing):