remy / inliner

Node utility to inline images, CSS and JavaScript for a web page - useful for mobile sites
MIT License
1.1k stars 165 forks source link

Support for manually specifying source encoding. #60

Closed m000 closed 9 years ago

m000 commented 9 years ago

Cheeriojs seems to have problem with international character sets. E.g. for some older pages in Greek all characters are converted to "�" in the output. This patch adds a -e option which allows specifying the source character set. The content is transcoded from the specified character set to utf-8 in the output.

remy commented 9 years ago

Few things that need fixing:

m000 commented 9 years ago

I will add a test for the new functionality.

Are the broken (existing) tests related to my changes? The log shows that there are some 404 errors when retrieving nodejs to create the test environment.

Minimist was introduced because the custom argument parsing code seems to only support flags. Adding support for options (e.g. "-e windows-1253") while maintaining the readability of the argument parsing doesn't seem to be easy. Moreover, we would be reinventing the wheel.

remy commented 9 years ago

Are the broken (existing) tests related to my changes

Yes. See the test results in the PR.

Re: encoding option - no one in their right mind knows all the 30 or so options that this can be. It should be automatically detected in the inlining process since the output will always be utf-8. Which means that the meta charset should be read before the cherrio parsing, and the user option should be removed and minimist can be removed (since it won't be required).

m000 commented 9 years ago

Can you give me some hints where to look to fix the iojs tests?

npm claims that this is most likely a problem with iconv. But I don't see any information on what's wrong with iconv. The errors start with the failed download of https://nodejs.org/dist/v1.8.4/node-v1.8.4.tar.gz (HTTP 404).

m000 commented 9 years ago

Regarding choosing the proper encoding option. You don't need to know all the codepages supported by iconv. You usually know in advance what to try for the page you want to download. E.g. for pages in greek written in the pre-utf-8 era, the choices were iso-8859-7 or windows-1253.

I agree that automatically picking the charset from the meta header would be cool and more user friendly. But let's leave that for a follow-up patch. I can raise a new issue as a reminder if you want.

Moreover, relying on the meta header is not foolproof. The value in the meta header may be wrong (I remember pages that claim to be latin1 in the header, while they were windows-1253). Or the autodetection code may not work perfectly (e.g. in the presence of malformed html tags). So, being able to override any automatic detection is still desirable. We can call -e an "expert option" if that sounds any better :).

remy commented 9 years ago

Being able to override is fine, but it should default to detection first.

I'm also wary that nearly all the issues on the iconv project are related to installs (which is what you're seeing failing in the pre-existing tests): https://github.com/bnoordhuis/node-iconv/issues

I've not tried it personally, but have you considered using https://github.com/ashtuchkin/iconv-lite instead? There's no node-gyp build step required and apparently covers most of what you'd want too.

m000 commented 9 years ago

I don't like the idea of automatic charset detection in Inliner. IMHO this should have been handled transparently by cheerio js. Inliner should just provide an override to the default cheerio behaviour. And if we add automatic charset detection and conversion, we will need to add a way to both override it with a custom encoding or turn it off completely. Until someone really needs the detection/conversion to happen automatically, providing it only as an override seems better to me.

If you find automatic detection/conversion an absolute "must have", I can implement this fix which uses charset and jschardet modules: http://stackoverflow.com/a/18712021/277172

Implementing character set detection from scratch is something I don't have the time to do (and debug) right now.

remy commented 9 years ago

I agree with everything you've said, but

providing it only as an override seems better to me

By "better" you mean "easier". And that's what I'm avoiding here.

The bigger problem is that I don't have any tests showing this fails right now. Inliner is setup to make it extremely easy to add failing tests. I think you could at least add that showing the greek character encoding failing (that you mentioned earlier IIRC).

My aim for all projects I develop is: simpler > more options.

But yeah, the downside is that it's more work on the developer.

m000 commented 9 years ago

The bigger problem is that I don't have any tests showing this fails right now. Inliner is setup to make it extremely easy to add failing tests. I think you could at least add that showing the greek character encoding failing (that you mentioned earlier IIRC).

I have added the test and the proper result in 7b5b43bf8f4f00ae6381de3fc506f7808d587218. The test result has been updated in a7758244b592f34039c20a62b8710baa8a606c87 so that it doesn't include html entities instead of utf-8 characters. a7758244b592f34039c20a62b8710baa8a606c87 was required because of the switch to iconv-lite - regular iconv worked without the changes introduced by this patch.

If you omit the -e option, all the greek characters are converted to "�" (displayed "�"). Feel free to try with the supplied test case.

remy commented 9 years ago

I've added to the PR allowing for automatic detection of encoding (with tests), and rewritten your commits as per the contributing doc.

Thanks for the PR.

m000 commented 9 years ago

Great. Thanks! I wasn't sure if you were waiting for me to implement this or not. Anyway, now that this feature is in place I'll work on the UA and image placeholder PR.