spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
778 stars 129 forks source link

Image with double-newline #497

Open ivan-kuzma-scx opened 2 years ago

ivan-kuzma-scx commented 2 years ago

Hello @spencermountain ,

just found one thing with first sentence while parsing Jesus topic. https://en.wikipedia.org/wiki/Jesus image image

Cheers

spencermountain commented 2 years ago

thanks, got a fix for this on dev

ivan-kuzma-scx commented 2 years ago

Thank you!

ivan-kuzma-scx commented 2 years ago

Hello @spencermountain , have found another one. Not sure if they are related.

https://en.wikipedia.org/wiki/Byzantine_Empire

image

spencermountain commented 2 years ago

thanks @Patrik-scx - i've reproduced this below:

let str = `The '''Byzantine Empire''' {{IPAc-en|z|{|n}} also referred to as the Eastern Roman Empire`
let doc = wtf(str)
console.log(doc.sentences()[0].text())

looks like the IPAc template is getting caught on the { character. Will add this to the next release. cheers

ivan-kuzma-scx commented 2 years ago

Cheers

ivan-kuzma-scx commented 1 year ago

Hello @spencermountain ,

Just found additional meta in first sentence of "Jewish diaspora".

Знімок екрана 2022-11-29 о 13 23 06
spencermountain commented 1 year ago

here's the issue - the <br/> tag becomes a double-newline, which is considered two paragraphs, which trips the image parser:

str = `[[File:Jewish people around the world.svg|thumb|Map of the Jewish diaspora.<br/>
foobar]]`
let doc = wtf(str)
console.log(doc.images())
// []

this may be a stumper. I'm pretty wary of supporting links that span paragraphs cheers