spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
770 stars 129 forks source link

Failure to parse long link tags #508

Closed marcelogp closed 1 year ago

marcelogp commented 1 year ago

Hey. I've found some links with very long titles and/or targets (sometimes including page anchors), which causes the link parser to break. Here are some examples:

https://en.wikipedia.org/wiki/Andrea_Bocelli [[Best-selling Christmas/holiday albums in the United States#Best-selling Christmas/holiday albums since Nielsen SoundScan tracking began|best-selling holiday albums]]

https://de.wikipedia.org/wiki/U2_(Band) [[Beziehungen zwischen Lateinamerika und den Vereinigten Staaten#El Salvador und Guatemala: Todesschwadronen als Mittel der Politik|damalige US-Politik in El Salvador]]

https://ru.wikipedia.org/wiki/The_Dark_Side_of_the_Moon [[Prog Archives#Топ-25 лучших альбомов прогрессивного рока (по версии Progarchives.com на январь 2015 года)|Топ-25 лучших альбомов прогрессивного рока по версии Progarchives.com]]

I understand the link parser regexes use repetition limits to be safer but I tried increasing these values. I tested with many pages and in every diff the parsing was fixed instead of broken, so I think it's safe to increase. Here's a PR: https://github.com/spencermountain/wtf_wikipedia/pull/507

spencermountain commented 1 year ago

great job! Thank you Marcelo. Please don't hesitate to make any additional changes you'd like. Will add this to the next release. cheers