metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

HttpOpener shall be able to handle GZIP encoding #511

Closed dr0i closed 6 months ago

dr0i commented 7 months ago

HttpOpener , aka decode-html when speaking flux, is not able to decode gziped data atm.

This was discovered by @TobiasNx failing to lookup schema.org .

dr0i commented 7 months ago

Working on it I noticed https://github.com/metafacture/metafacture-core/blob/52e41414fff130c183151dbb64810f259548ddf8/metafacture-io/src/test/java/org/metafacture/io/HttpOpenerTest.java#L259 This seems invalid because:

The content-encoding specifies the data transfer encoding used by the issuer of the content. UTF-8 is not a content encoding, it is a character set. Specifying the character set is done in the content-type header

(https://stackoverflow.com/questions/17154967/is-content-encoding-being-set-to-utf-8-invalid)

[EDIT] there seems a principal misunderstanding of encoding in HttpOpener as a synonym for charset. I Propose to rename variables and methods to charset , probably mark setEncoding(String) as deprecated.