Closed Kubuxu closed 4 years ago
As of https://github.com/ipfs/go-ipfs/pull/2008, content type is already auto-detected (from the first 512 bytes through http.DetectContentType
).
To detect charset, however, requires 1. importing https://godoc.org/golang.org/x/net/html/charset, 2. rewriting http.ServeContent
to use charset.DetermineEncoding
.
Looking at the importers list, it appears that charset auto-detect hasn't been used in most of major golang web frameworks, yet is used to format the diff content in gogs: https://github.com/gogits/gogs/blob/c199703e2aa746f10ff2b584513fe3af810b6c1a/models/git_diff.go.
There should not be a major perf hit with checking the first 1kb of each content served in the gateway.
I tried the charset.DetermineEncoding
, but it detected the charset in https://ipfs.io/ipfs/QmYJxHZ5MeqKF5vbp8Wei71fNkBPdBSB3CRYSW75fA8yFE/ as windows-125223.
ISO-8859 is equivalent to windows-1252 so it might be good.
More precisely windows-1252 is superset of ISO-8859
any way we can not set the type and leave it up to browsers? or the golang lib?
We could try detecting if there is charset declaration in
then not setting charset in the HTTP header. This would solve this problem as browser will then use charset from .Heuristic would be very simple, and probably fast.
/meta[^>]+charset=/i
.
Example result: http://regexr.com/3claa, http://regexr.com/3cladAny cons of this method?
Hmm: "Although this parameter is optional, we recommend that it always be present (in the Content-Type)"[1].
note: I have tested #2230 and it does render the lua page properly.
Also, @Kubuxu, charset.DetermineEncoding
does the heuristic which is based on http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding. It does look for the charset
attribute as in https://github.com/golang/net/blob/master/html/charset/charset.go#L194.
Awesome, my response was to jbenet's question if we can leave it up to browser.
The HTTP Api seems to arbitrarily include charsets in the headers, at times, and not, at other times. Should it be included all of the time, @Kubuxu?
It shouldn't be included always as it overrides setting in HTML tags and also default HTML charset is different.
I guess I am confused: how would a Header response from the API override HTML tags? Does the go HTTP API ever return HTML?
Ahh, this issue is about charset in Gateway, sorry for confusing you. About API: it should be included almost always AFIAK, (json implicitly defines utf-8 but there are some edge cases in both directions, including it or not, but as it is included in some places is should probably be included all the time).
Makes sense. Thanks! Sorry for redirecting the flow of this issue.
2019 is almost over and the issue is still not resolved.
This should be labeled as a bug since it actually "breaks" some web contents; for example, trying to mirror non-UTF-8-encoded Project Guterberg HTML books like this one: Utilitarianism (this is a just trivial example).
This is the output I got by executing curl --head "https://gateway.ipfs.io/ipfs/QmXjU4HTejtdY5bouHmXny5NMyE4RfsnXNqizcYdVusr46/books/utilitarianism.html"
HTTP/2 200
server: nginx
date: Fri, 01 Nov 2019 20:00:13 GMT
content-type: text/html; charset=utf-8
content-length: 186678
vary: Accept-Encoding
accept-ranges: bytes
access-control-allow-methods: GET
cache-control: public, max-age=29030400, immutable
etag: "QmSFFoikBNJZmN2muk4CbpAXtCRdJQjv2ZsrKArEP587tt"
last-modified: Thu, 01 Jan 1970 00:00:01 GMT
suborigin: ipfs000bciqixel7d5xij5x4pugxurohowwtxgex2embtxb747dyeucl4el2zjy
x-ipfs-gateway-host: gateway-bank1-mrs1
x-ipfs-path: /ipfs/QmXjU4HTejtdY5bouHmXny5NMyE4RfsnXNqizcYdVusr46/books/utilitarianism.html
access-control-allow-origin: *
access-control-allow-methods: GET, POST, OPTIONS
access-control-allow-headers: X-Requested-With, Range, Content-Range, X-Chunked-Output, X-Stream-Output
access-control-expose-headers: Content-Range, X-Chunked-Output, X-Stream-Output
x-ipfs-pop: gateway-bank1-mrs1
strict-transport-security: max-age=31536000; includeSubDomains; preload
The fix should be pretty simple. We need to modify serveFile
in github.com/ipfs/go-ipfs/core/corehttp/gateway_handler.go to do the file type detection that ServeContent normally does, then strip the charset if the content type is text/html (or maybe just strip the charset).
Want to try tackling it?
@Stebalien Oh, yes, I'd like to try.
Charset returned in HTTP header overrides charset in meta tag but is essential for text files and other content.
Encoding is broken on some sites like: https://ipfs.io/ipfs/QmYJxHZ5MeqKF5vbp8Wei71fNkBPdBSB3CRYSW75fA8yFE/ that use non UTF-8 encoding.
Solution would be to run simple charset meta tag detection. It could be heuristic for sake of speed.
Also good caching would be great.