shlomif / perl-XML-LibXML

The XML-LibXML CPAN Distribution for Processing XML using the libxml2 library
https://metacpan.org/release/XML-LibXML
Other
17 stars 35 forks source link

use utf8 instead of Encode #58

Closed atoomic closed 3 years ago

atoomic commented 3 years ago

utf8::encode can be used instead of the Encode::encode function in this case.

This is avoiding loading Encode where unneeded and reduces memory footprint.

Check to confirm using utf8::encode provides the same behavior:

╰─> perl -e 'my $x = "\x{100}"; use Devel::Peek; Dump $x; utf8::encode($x); Dump $x; my $y = "\x{100}"; use Encode; my $z = Encode::encode( utf8 => $y, Encode::FB_DEFAULT); Dump $z'
SV = PV(0x7fd2cf00b650) at 0x7fd2cf8119c8
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  PV = 0x7fd2ced0e490 "\304\200"\0 [UTF8 "\x{100}"]
  CUR = 2
  LEN = 10
  COW_REFCNT = 1
SV = PV(0x7fd2cf00b650) at 0x7fd2cf8119c8
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x7fd2ced0e490 "\304\200"\0
  CUR = 2
  LEN = 10
  COW_REFCNT = 1
SV = PV(0x7fd2cf00b7c0) at 0x7fd2cf81e278
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x7fd2d910f3c0 "\304\200"\0
  CUR = 2
  LEN = 10
Grinnz commented 3 years ago

This will result in invalid UTF-8 output if presented with noncharacters (U+FDD0), surrogates (U+D800), or codepoints outside the Unicode set (U+110000), but that was already true since Encode was being used with 'utf8' and not 'UTF-8', so it is technically true that this will not be a behavior change.

Grinnz commented 3 years ago

Personally, I don't see this as a win - the Encode usage can be improved in the future but utf8::encode cannot, and this is not a bottleneck.

atoomic commented 3 years ago

The main win is on the memory used by the process, here is a dumb/simple check on linux

╰─> perl -e 'use utf8; print qx{grep RSS /proc/$$/status}'
VmRSS:      2300 kB
╰─> perl -e 'use Encode; print qx{grep RSS /proc/$$/status}'
VmRSS:      4256 kB
toddr commented 3 years ago

@Grinnz are you suggesting this is the change we should be making here?

-      my $context = Encode::encode('utf8', $self->{context}, Encode::FB_DEFAULT);
+      my $context = Encode::encode('UTF-8', $self->{context}, Encode::FB_DEFAULT);
Grinnz commented 3 years ago

If anything, I would suggest removal of the Encode::FB_DEFAULT argument, since that is already default (as the name suggests) and specifying it explicitly in this way causes $self->{context} to get modified in place due to Encode's weird API.

I also think using the 'UTF-8' encoding in this case is a good idea though, as in most cases.

toddr commented 3 years ago

@Grinnz: If I read all of your comments correctly, you seem to be saying that this change should have no regressions. You also seem to be saying that the code in question is wrong and we should fix it. Are you up for providing a pull request for it?

Grinnz commented 3 years ago

Sure, opened as #60

atoomic commented 3 years ago

discussion moved to #60