samvera / questioning_authority

Question your authorities
Other
54 stars 30 forks source link

Replace URI.unescape with CGI.unescape #369

Closed eddierubeiz closed 2 years ago

eddierubeiz commented 2 years ago

(issue description by jrochkind)

URI.unescape does not exist anymore in ruby 3.1, after having been long-deprecated.

There are a variety of other ways in stdlib to do some kind of html-related unescape, but none have the exact same semantics.

It is hard to know which we want -- when we don't understand why this unescape is here in the first place. Why would an in-memory id in the LinkedData::FindTerm authority be in an escaped form, and need unescaping before doing... other things that honestly we're not sure what they're doing, which doesn't make it easier to know what form of escaping, if any, is necessary here.

All tests pass without any unescaping, there are no tests that require escaping, so that doesn't answer the question of why escaping is being done here, and what form it should be in.

My inclination is to think that the escaping was unnecessary or even a bug (or a workaround to a bug somewhere else?), and should simply be removed.

If that causes a problem for someone in a real-world use-case, that use case might help us understand why escaping was there in the first place, and the best way to fix it. But without a report of such a case... remove the unescape call for ruby 3.1 compatibility?

jrochkind commented 2 years ago

@eddierubeiz Can you also actually edit the COMMIT MESSAGE the commit where you remove URI.unescape, to explain the motivation/justificaition for removing it, at just a short summary or something? Something more than a commit message saying "remove URI.unescape".

i want to leave something better in the commit message for someone who comes later and is like "Why did they remove this??" -- having been in that position in the past.

You can use git rebase to rewrite commit messages I think.

dlpierce commented 2 years ago

This comment has some explanation: https://github.com/samvera/questioning_authority/pull/240/files#r282046523

jrochkind commented 2 years ago

Thanks @dlpierce , good find!

I wish I understood what that comment meant, but I don't!

Do you have any advice for what should be done here with URI.unescape no longer being available in the stdlib?

'%20' is changed to ' ', and '%2E' is changed to '.', etc.

We could write a routine that did those specifically... or do we need a routine that does any/all percent-encoding? Or what? There's a lot hiding in that "etc" when you realize how many different variants of escaping/unescaping for urls/html there are.

We can just replace with some alternate stdlib method likeCGI.unescape... it looks like that will do %20 and %2E as in that comment... assume that does the right thing?

Any advice for us would be most welcome @dlpierce

eddierubeiz commented 2 years ago

I replaced URI.unescape with CGI.unescape. I'll see if I can add a test for find_term_spec.rb to shore all of this up a bit.

jrochkind commented 2 years ago

So one way CGI.unescape differs from URI.unescape is for +.

irb(main):003:0> URI.unescape("+")
(irb):3: warning: URI.unescape is obsolete
=> "+"
irb(main):004:0> CGI.unescape("+")
=> " "

So currently, a + will be left alone by the unescaping process, but if we change to CGI.unescape, it'll be converted to a space. I still don't understand the use case of why this is being done enough to know if this might cause a problem. @dlpierce or anyone else, any thoughts? (Not sure if there are other differences too in CGI.unescape vs URI).

It seems from the comments @dlpierce found like this has something to do with LC specifically -- although this authority is not for LC specifically? It would be nice to have an automated test that actually duplicated the LC use case, but I don't think me and @eddierubeiz understand this code (which we don't use) enough to do that, I think any test we came up with would not be very realistic/reliable. I worry that we may break something for a non-LC use case... but it's also not clear to me who if anyone is using this code for what (we don't use it).

Also, I just noticed URI.unescape was already not available in ruby 3.0 -- I don't understand how tests were passing in ruby 3.0 already with URI.unescape in there, which worries me a bit!

eddierubeiz commented 2 years ago

I'm hoping the docs at https://github.com/samvera/questioning_authority/wiki/Using-the-Linked-Data-module-to-access-authorities provide with enough info to be able to figure out a decent use case that can be used in a test. I'm going to tinker with this for a bit.

eddierubeiz commented 2 years ago

I'm going to recommend we replace the three instances of URI.unescape with CGI.unescape unless anyone else has strong feelings about it.

I've spent some time configuring and using this code, and I do understand how it works somewhat better now (at least in the LOC case). I do concur that it's quite possible the first unescape is in fact not needed and could be replaced by a dup. But I don't feel comfortable doing it without spending more time tinkering with the code -- and I'm unfortunately running out of time :) .

Re: CGI.unescape versus some other (non-obsolete) escaping recipe: based on a quick survey of the form of RDF predicates at LOC and OCLC, mapping a + in the id property with a space in this class is unlikely to prevent any matches from occurring. Again it's tough to say for sure without a more robust set of tests for this class -- and, in turn, I really can't write those tests without a better understanding of real-world use-cases.