sporkmonger / addressable

Addressable is an alternative implementation to the URI implementation that is part of Ruby's standard library. It is flexible, offers heuristic parsing, and additionally provides extensive support for IRIs and URI templates.
Apache License 2.0
1.56k stars 266 forks source link

normalize issue when face special character #455

Closed calvinsugianto closed 1 year ago

calvinsugianto commented 2 years ago

Hello guys, I got issue when I parse special character like é with this code Addressable::URI.parse(url).normalize it will change é into %C3%A9 and this caused an error.

what I need is to parse it into UTF-8 Format become e%CC%81

is is possible with this gem ?

example url: https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Boisé-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg

Instead of this current parsing condition:

https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Bois%C3%A9-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg -> wrong

it should be become this UTF-8 format: https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Boise%CC%81-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg -> correct

dentarg commented 1 year ago

No, it is not possible.

https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Bois%C3%A9-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg is correct

As an example, that is what I get back from Google Chrome if I enter the url with é in the path: https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Bois%C3%A9-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg (when I copy the URL)

From https://github.com/ruby/uri/issues/40#issuecomment-1436138080

No, a URI path is not allowed to contain arbitrary UTF-8 characters. Non-ASCII UTF-8 characters must be percent encoded, and even some ASCII characters must be percent encoded.

and if you would try to parse that URL using Ruby uri, it would blow up

irb(main):017:0> URI("https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Boisé-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg")
/Users/dentarg/.arm64_rubies/3.2.2/lib/ruby/3.2.0/uri/rfc3986_parser.rb:20:in `split': URI must be ascii only "https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Bois\u00E9-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg" (URI::InvalidURIError)
    from /Users/dentarg/.arm64_rubies/3.2.2/lib/ruby/3.2.0/uri/rfc3986_parser.rb:71:in `parse'
    from /Users/dentarg/.arm64_rubies/3.2.2/lib/ruby/3.2.0/uri/common.rb:193:in `parse'
    from /Users/dentarg/.arm64_rubies/3.2.2/lib/ruby/3.2.0/uri/common.rb:722:in `URI'
maxime-carbonneau commented 3 months ago

I would like to re-open this issue. I think there is a misunderstanding.

There is (at least) 2 ways to represent the letter « é » :

  1. The character itself « é », which is represent by the number 233
  2. The sequence « e » + « some kind of fronttick », which are represent by numbers 101 + 769

That how « http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Côté-2.0-M.jpg » is correctly convert to « http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Co%CC%82te%CC%81-2.0-M.jpg »

dentarg commented 3 months ago

Google Chrome is converting http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Côté-2.0-M.jpg to http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-C%C3%B4t%C3%A9-2.0-M.jpg for me

maxime-carbonneau commented 3 months ago

The link is an image coming from http://ferrisson.com/pierre-paul-cote-csq/

According to my Google Inspector, the link should be converted to http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Co%CC%82te%CC%81-2.0-M.jpg

Capture d’écran, le 2024-08-23 à 16 29 07

I also post Safari Inspector since the conversion is more obvious.

Capture d’écran, le 2024-08-23 à 16 27 17
dentarg commented 3 months ago

That how « http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Côté-2.0-M.jpg » is correctly convert ...

I copied from your message here on GitHub when I got http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-C%C3%B4t%C3%A9-2.0-M.jpg

I can also see the source code on http://ferrisson.com/pierre-paul-cote-csq/ referencing http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Côté-2.0-M.jpg and when I enter that into the address bar in Google then image loads and if I copy the URL from the address bar it is http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Co%CC%82te%CC%81-2.0-M.jpg

dentarg commented 3 months ago

I copied from your message here on GitHub when I got http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-C%C3%B4t%C3%A9-2.0-M.jpg

To be more clear, right click on the URL and "Copy Link Address" gave me http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-C%C3%B4t%C3%A9-2.0-M.jpg

dentarg commented 3 months ago

Anyway, even if Chrome is supporting more representations I'm not sure we can do that in Addressable (see the previous comments in the thread)