ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.15k stars 3.01k forks source link

Including charset in HTTP breaks non UTF-8 sites #2203

Closed Kubuxu closed 4 years ago

Kubuxu commented 8 years ago

Charset returned in HTTP header overrides charset in meta tag but is essential for text files and other content.

Encoding is broken on some sites like: https://ipfs.io/ipfs/QmYJxHZ5MeqKF5vbp8Wei71fNkBPdBSB3CRYSW75fA8yFE/ that use non UTF-8 encoding.

Solution would be to run simple charset meta tag detection. It could be heuristic for sake of speed.

Also good caching would be great.

rht commented 8 years ago

As of https://github.com/ipfs/go-ipfs/pull/2008, content type is already auto-detected (from the first 512 bytes through http.DetectContentType). To detect charset, however, requires 1. importing https://godoc.org/golang.org/x/net/html/charset, 2. rewriting http.ServeContent to use charset.DetermineEncoding.

Looking at the importers list, it appears that charset auto-detect hasn't been used in most of major golang web frameworks, yet is used to format the diff content in gogs: https://github.com/gogits/gogs/blob/c199703e2aa746f10ff2b584513fe3af810b6c1a/models/git_diff.go.

There should not be a major perf hit with checking the first 1kb of each content served in the gateway.

rht commented 8 years ago

I tried the charset.DetermineEncoding, but it detected the charset in https://ipfs.io/ipfs/QmYJxHZ5MeqKF5vbp8Wei71fNkBPdBSB3CRYSW75fA8yFE/ as windows-125223.

Kubuxu commented 8 years ago

ISO-8859 is equivalent to windows-1252 so it might be good.

More precisely windows-1252 is superset of ISO-8859

jbenet commented 8 years ago

any way we can not set the type and leave it up to browsers? or the golang lib?

Kubuxu commented 8 years ago

We could try detecting if there is charset declaration in then not setting charset in the HTTP header. This would solve this problem as browser will then use charset from .

Heuristic would be very simple, and probably fast.

  1. Take 1500 bytes.
  2. Truncate it to .
  3. Check if regex matches: /meta[^>]+charset=/i. Example result: http://regexr.com/3claa, http://regexr.com/3clad

Any cons of this method?

rht commented 8 years ago

Hmm: "Although this parameter is optional, we recommend that it always be present (in the Content-Type)"[1].

note: I have tested #2230 and it does render the lua page properly.

ref: [1] https://www.w3.org/TR/html4/conform.html

rht commented 8 years ago

Also, @Kubuxu, charset.DetermineEncoding does the heuristic which is based on http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding. It does look for the charset attribute as in https://github.com/golang/net/blob/master/html/charset/charset.go#L194.

Kubuxu commented 8 years ago

Awesome, my response was to jbenet's question if we can leave it up to browser.

RichardLitt commented 8 years ago

The HTTP Api seems to arbitrarily include charsets in the headers, at times, and not, at other times. Should it be included all of the time, @Kubuxu?

Kubuxu commented 8 years ago

It shouldn't be included always as it overrides setting in HTML tags and also default HTML charset is different.

RichardLitt commented 8 years ago

I guess I am confused: how would a Header response from the API override HTML tags? Does the go HTTP API ever return HTML?

Kubuxu commented 8 years ago

Ahh, this issue is about charset in Gateway, sorry for confusing you. About API: it should be included almost always AFIAK, (json implicitly defines utf-8 but there are some edge cases in both directions, including it or not, but as it is included in some places is should probably be included all the time).

RichardLitt commented 8 years ago

Makes sense. Thanks! Sorry for redirecting the flow of this issue.

djalilhebal commented 5 years ago

2019 is almost over and the issue is still not resolved.

This should be labeled as a bug since it actually "breaks" some web contents; for example, trying to mirror non-UTF-8-encoded Project Guterberg HTML books like this one: Utilitarianism (this is a just trivial example).


This is the output I got by executing curl --head "https://gateway.ipfs.io/ipfs/QmXjU4HTejtdY5bouHmXny5NMyE4RfsnXNqizcYdVusr46/books/utilitarianism.html"

HTTP/2 200 
server: nginx
date: Fri, 01 Nov 2019 20:00:13 GMT
content-type: text/html; charset=utf-8
content-length: 186678
vary: Accept-Encoding
accept-ranges: bytes
access-control-allow-methods: GET
cache-control: public, max-age=29030400, immutable
etag: "QmSFFoikBNJZmN2muk4CbpAXtCRdJQjv2ZsrKArEP587tt"
last-modified: Thu, 01 Jan 1970 00:00:01 GMT
suborigin: ipfs000bciqixel7d5xij5x4pugxurohowwtxgex2embtxb747dyeucl4el2zjy
x-ipfs-gateway-host: gateway-bank1-mrs1
x-ipfs-path: /ipfs/QmXjU4HTejtdY5bouHmXny5NMyE4RfsnXNqizcYdVusr46/books/utilitarianism.html
access-control-allow-origin: *
access-control-allow-methods: GET, POST, OPTIONS
access-control-allow-headers: X-Requested-With, Range, Content-Range, X-Chunked-Output, X-Stream-Output
access-control-expose-headers: Content-Range, X-Chunked-Output, X-Stream-Output
x-ipfs-pop: gateway-bank1-mrs1
strict-transport-security: max-age=31536000; includeSubDomains; preload
Stebalien commented 5 years ago

The fix should be pretty simple. We need to modify serveFile in github.com/ipfs/go-ipfs/core/corehttp/gateway_handler.go to do the file type detection that ServeContent normally does, then strip the charset if the content type is text/html (or maybe just strip the charset).

Want to try tackling it?

djalilhebal commented 5 years ago

@Stebalien Oh, yes, I'd like to try.