pr0gramista / charset_converter

Flutter platform charset converter
BSD 3-Clause "New" or "Revised" License
33 stars 22 forks source link

Result of decode http response is null #10

Closed ghost closed 3 years ago

ghost commented 3 years ago

I try to decode http response bodybyte into UTF-8 at iOS emulator(iPhone 12 Pro Max emulator. (iOS Deployment Target=9.0)). But result of decode was null. In android emulator case same code is successful. So I consider that this issue is only iOS side.

Please confirm this code and if possible please share solution.

import 'dart:async';
import 'dart:typed_data';
import 'package:flutter/material.dart';
import 'package:flutter/services.dart';
import 'package:html/dom.dart' as dom;
import 'package:http/http.dart' as http;
import 'package:charset_converter/charset_converter.dart';
import 'package:flutter_user_agent/flutter_user_agent.dart';
// skip

String userAgent;
try {
  userAgent = await FlutterUserAgent.getPropertyAsync('userAgent');
  print("userAgent: ${userAgent}");
} on PlatformException {
  userAgent = '<error>';
}
var response = await http.Client().get(Uri.parse("http://news4vip.livedoor.biz/archives/52385788.html"), headers: {'User-Agent': userAgent});
print("Response status: ${response.statusCode}");
print("response.headers: ${response.headers['content-type']}");
String decoded_body_byte = await CharsetConverter.decode("UTF-8", response.bodyBytes);
print("decoded_body_byte: ${decoded_body_byte}"); // This result is null. This is issue.
Uint8List encoded = await CharsetConverter.encode("UTF-8", "【画像】中日「かっこいい」今季のユニホーム発表www");
print("encoded.length: ${encoded.length}");
String decoded_body_byte_only_title = await CharsetConverter.decode("UTF-8", response.bodyBytes.sublist(71, 71 + 78));
print("decoded_body_byte_only_title: ${decoded_body_byte_only_title}");`

The following is the output result of the above code.

`2021-01-23 17:09:29.964984+0900 Runner[89036:14458916] flutter: userAgent: CFNetwork/1209 Darwin/20.2.0 (iPhone iOS/14.3)
2021-01-23 17:09:30.187131+0900 Runner[89036:14458916] flutter: Response status: 200
2021-01-23 17:09:30.190547+0900 Runner[89036:14458916] flutter: response.headers: text/html; charset=utf-8
2021-01-23 17:09:30.195755+0900 Runner[89036:14458916] flutter: decoded_body_byte: null
2021-01-23 17:09:30.197368+0900 Runner[89036:14458916] flutter: encoded.length: 78
2021-01-23 17:09:30.198128+0900 Runner[89036:14458916] flutter: decoded_body_byte_only_title: 【画像】中日「かっこいい」今季のユニホーム発表www

Below is the output result of $ flutter doctor. I would appreciate it if you could answer.

`% flutter doctor
Doctor summary (to see all details, run flutter doctor -v):
[✓] Flutter (Channel stable, 1.22.5, on macOS 11.1 20C69 darwin-x64, locale ja-JP)

[!] Android toolchain - develop for Android devices (Android SDK version 29.0.2)
    ! Some Android licenses not accepted.  To resolve this, run: flutter doctor --android-licenses
[✓] Xcode - develop for iOS and macOS (Xcode 12.3)
[!] Android Studio (version 4.1)
    ✗ Flutter plugin not installed; this adds Flutter specific functionality.
    ✗ Dart plugin not installed; this adds Dart specific functionality.
[✓] VS Code (version 1.52.1)
[✓] Connected device (2 available)

! Doctor found issues in 2 categories.`
pr0gramista commented 3 years ago

Hi, I think you have an error in your decode usage. decode expects the first parameter to be encoding of the data and the site you are trying to download is encoded using euc-jp - this can be seen in Content-Type header or metadata tag in HTML.

The output of decode is String which in Dart is UTF-16.

I think using it like this gives expected output.

var response = await http.Client()
    .get(Uri.parse("http://news4vip.livedoor.biz/archives/52385788.html"));

print("Status: ${response.statusCode}");
// Content-Type: text/html; charset=euc-jp
print("Content-Type: ${response.headers['content-type']}");

String decoded_body_byte =
    await CharsetConverter.decode("euc-jp", response.bodyBytes);
print("Decoded: ${decoded_body_byte}");

Output:

flutter: Status: 200
flutter: Content-Type: text/html; charset=euc-jp
flutter: Decoded: <?xml version="1.0" encoding="EUC-JP"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" id="ldblog-standard">
<head>
<meta name="google-site-verification" content="0yUUhZOYqVOrcYBJ-Tw1lYw-D7kiorCn-4kDbnhK-ac" />
<meta http-equiv="Content-Type" content="text/html; charset=euc-jp" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta http-equiv="Content-Script-Type" content="text/javascript" /><link rel="shortcut icon" type="image/vnd.microsoft.icon" href="https://livedoor.blogimg.jp/news4vip2/imgs/2/b/favicon.ico" /><link rel="icon" href="https://livedoor.blogimg.jp/news4vip2/imgs/2/b/2b6a2183.png" />
<link rel="stylesheet" href="https://parts.blog.livedoor.jp/css/template.css?v=20190826" type="text/css" />
<link rel="stylesheet" href="https://parts.blog.livedoor.jp/css/comment2/heart.css?v=20180704" type="text/css" />
<link rel="stylesheet" <…>
pr0gramista commented 3 years ago

I see that Android (with 'utf-8` passed) decodes most HTML, but gives up on Japanese characters. This is somewhat expected as these decoders will usually try their best even given bad input.

<div class="sidebody"><a href="http://5chmm.jp/">5ch�ޤȤ�ΤޤȤ�</a></div>

Passing euc-jp fixes it too.

ghost commented 3 years ago

Sorry, my explanation was lacking. This web page change response depend on userAgent. Could you please reconfirm this issue after changing userAgent to CFNetwork/1209 Darwin/20.2.0 (iPhone iOS/14.3). Then you can get response.headers['content-type'] = text/html; charset=utf-8 and get my error.

pr0gramista commented 3 years ago

Yeah I see the problem. The site has some malformed characters and as we see iOS decoder does not like it. I'll try to fix that later by adding a option to ignore such characters.

However if you are just decoding from UTF-8 you may actually not need this package at all. Dart has Utf8Decoder and it will actually return an error too, but not if you pass true to allowMalformed like this:

final decoded = Utf8Decoder(allowMalformed: true).convert(response.bodyBytes);

I also tracked down malformed characters:

<div id="ad" style="display:block !important;">

<!-- �칭��  -->

<script type="text/javascript">
  if (window['header_cd'] && window['showed_header_ad'] != 1) {
    show_ad(header_cd);
    showed_header_ad = 1;
  }
</script>
ghost commented 3 years ago

I can decode target web page using your below commands. final decoded = Utf8Decoder(allowMalformed: true).convert(response.bodyBytes); And I'm happy for your update for this plugin.

Thank you for your help!