internetarchive / iari

Import workflows for the Wikipedia Citations Database
GNU General Public License v3.0
12 stars 9 forks source link

Bug: URL's being mis-parsed from wikitext #878

Closed mojomonger closed 1 year ago

mojomonger commented 1 year ago

The following URLS produce status codes of 0 with check-url.

  1. They all have suspicious "outlier" characters in their urls, such as "|" or "}}".
  2. These can be attributed to errors when parsing the URLs from the underlying wikitext.
dpriskorn commented 1 year ago

Nice find! We need more tests of the url parsing it seems 😀

mojomonger commented 1 year ago

consider this bug part of those tests...! i can give you the wikitext that each of these error parsings derive from, if that helps.

dpriskorn commented 1 year ago

Yes, that helps, my tests are based on wikicode

mojomonger commented 1 year ago

@dpriskorn Here are some wikitexts for the problem URLs. I omitted the first one, as it does not seem to be exhibiting same behavior.

http://www.jps.auckland.ac.nz/document/Volume_104_1995/Volume_104%2C_No._3/Preliminary_evidence_for_cosmogonic_texts_in_Rapanui%26apos%3Bs_Rongorongo_inscriptions%2C_by_Steven_Roger_Fischer%2C_p_303-322/p1}}">

* {{cite journal|last= Fischer|first= Steven Roger|year= 1995|title= Preliminary Evidence for Cosmogonic Texts in Rapanui's Rongorongo Inscriptions|journal= Journal of the Polynesian Society |issue=104|pages=303–21|url=http://www.jps.auckland.ac.nz/document/Volume_104_1995/Volume_104%2C_No._3/Preliminary_evidence_for_cosmogonic_texts_in_Rapanui%26apos%3Bs_Rongorongo_inscriptions%2C_by_Steven_Roger_Fischer%2C_p_303-322/p1}}

https://archive.org/details/akuakusecretofea00heye\|url-access=registration}}'>

* {{cite book|last=Heyerdahl|first=Thor |year=1958 |title=Aku-Aku; The 1958 Expedition to Easter Island.|publisher=Chicago, Rand McNally |url=https://archive.org/details/akuakusecretofea00heye|url-access=registration}}

This one is being "doubly parsed", as the correct URL is also extracted. https://archive.org/details/islandatcenterof00seba|

* {{cite book| last=Englert|first=Sebastian F. |year=1970|title=Island at the Center of the World| url=https://archive.org/details/islandatcenterof00seba| url-access=registration|location=New York|publisher=Charles Scribner's Sons}}

https://archive.org/details/rockartofeasteri0000leeg|last=

* {{cite book|url= https://archive.org/details/rockartofeasteri0000leeg|last= Lee|first= Georgia|year= 1992|title= The Rock Art of Easter Island. Symbols of Power, Prayers to the Gods|location= Los Angeles|publisher= The Institute of Archaeology Publications|isbn= 978-0917956744}}

mojomonger commented 1 year ago

I was thinkng, even tho the pipe character is ALLOWED in the http url specification (https://developers.google.com/maps/url-encoding#:~:text=If%20you%20use%20a%20pipe,properly%20escaped%20for%20your%20platform.), we really should exclude it.

We can tag it as a malformed url, since, even if it was "acceptable", we could say it is abnormal and warn our editors if they were using a pipe in the url they should change it.

This way we can include "|" (pipe character) as part of our regex parsing boundary

dpriskorn commented 1 year ago

Have you checked if enforce encoding of this char? It interferes with the template parsing I'm guessing.

mojomonger commented 1 year ago

what do you mean by: "if enforce encoding of this char" ?

dpriskorn commented 1 year ago

That Wikipedia give the user an error when they try to insert urls with that character in a template or help them encode it.

dpriskorn commented 1 year ago

Found the bug causing this 😀

dpriskorn commented 1 year ago

PR is now done. I'm waiting for the CI to complete I did not modify the regex as it was not the cause of the bug. We defer to mwparserfromhell to extract the comments and that does not seem to work right now. More invesigation is needed. For now I simply added to the limitation to the readme in the article endpoint that we do not support extracting URLs from comments currently.

dpriskorn commented 1 year ago

I'm missing an article url to test this on. Do you have one?