Closed mojomonger closed 1 year ago
Nice find! We need more tests of the url parsing it seems 😀
consider this bug part of those tests...! i can give you the wikitext that each of these error parsings derive from, if that helps.
Yes, that helps, my tests are based on wikicode
@dpriskorn Here are some wikitexts for the problem URLs. I omitted the first one, as it does not seem to be exhibiting same behavior.
* {{cite journal|last= Fischer|first= Steven Roger|year= 1995|title= Preliminary Evidence for Cosmogonic Texts in Rapanui's Rongorongo Inscriptions|journal= Journal of the Polynesian Society |issue=104|pages=303–21|url=http://www.jps.auckland.ac.nz/document/Volume_104_1995/Volume_104%2C_No._3/Preliminary_evidence_for_cosmogonic_texts_in_Rapanui%26apos%3Bs_Rongorongo_inscriptions%2C_by_Steven_Roger_Fischer%2C_p_303-322/p1}}
https://archive.org/details/akuakusecretofea00heye\|url-access=registration}}'>
* {{cite book|last=Heyerdahl|first=Thor |year=1958 |title=Aku-Aku; The 1958 Expedition to Easter Island.|publisher=Chicago, Rand McNally |url=https://archive.org/details/akuakusecretofea00heye|url-access=registration}}
This one is being "doubly parsed", as the correct URL is also extracted. https://archive.org/details/islandatcenterof00seba|
* {{cite book| last=Englert|first=Sebastian F. |year=1970|title=Island at the Center of the World| url=https://archive.org/details/islandatcenterof00seba| url-access=registration|location=New York|publisher=Charles Scribner's Sons}}
https://archive.org/details/rockartofeasteri0000leeg|last=
* {{cite book|url= https://archive.org/details/rockartofeasteri0000leeg|last= Lee|first= Georgia|year= 1992|title= The Rock Art of Easter Island. Symbols of Power, Prayers to the Gods|location= Los Angeles|publisher= The Institute of Archaeology Publications|isbn= 978-0917956744}}
I was thinkng, even tho the pipe character is ALLOWED in the http url specification (https://developers.google.com/maps/url-encoding#:~:text=If%20you%20use%20a%20pipe,properly%20escaped%20for%20your%20platform.), we really should exclude it.
We can tag it as a malformed url, since, even if it was "acceptable", we could say it is abnormal and warn our editors if they were using a pipe in the url they should change it.
This way we can include "|" (pipe character) as part of our regex parsing boundary
Have you checked if enforce encoding of this char? It interferes with the template parsing I'm guessing.
what do you mean by: "if enforce encoding of this char" ?
That Wikipedia give the user an error when they try to insert urls with that character in a template or help them encode it.
Found the bug causing this 😀
PR is now done. I'm waiting for the CI to complete I did not modify the regex as it was not the cause of the bug. We defer to mwparserfromhell to extract the comments and that does not seem to work right now. More invesigation is needed. For now I simply added to the limitation to the readme in the article endpoint that we do not support extracting URLs from comments currently.
I'm missing an article url to test this on. Do you have one?
The following URLS produce status codes of 0 with check-url.
[ ] http://gallica.bnf.fr/ark:/12148/bpt6k34409r/f228.image.r=Le%20Tour%20du%20monde%20%28Paris%201860%29\|title=Voyage
[ ] http://www.jps.auckland.ac.nz/document/Volume_104_1995/Volume_104%2C_No._3/Preliminary_evidence_for_cosmogonic_texts_in_Rapanui%26apos%3Bs_Rongorongo_inscriptions%2C_by_Steven_Roger_Fischer%2C_p_303-322/p1}}">
[ ] https://archive.org/details/akuakusecretofea00heye\|url-access=registration}}'>
[ ] https://archive.org/details/islandatcenterof00seba|
[ ] https://archive.org/details/rockartofeasteri0000leeg\|last=