Remove wasteful safer Inet4 lookups

franz1981 commented 9 months ago

@radcortez can you check if the ones happening in clinit are the one you saw in the flamegraphs?

here

i can see many, not just the 3 ones I've fixed.

franz1981 commented 9 months ago

Ah no, the flamegraph is per line of code, so the fix here should take care of it already

radcortez commented 9 months ago

Here is the flamegraph with the current behaviour: alloc-999.html.zip

And with the change: alloc-999-sr-common-276.html.zip

franz1981 commented 9 months ago

Thanks @radcortez

so

The overall saving as "allocation bamdwidth" is ~217KB, due to removing the indy allocation, which "can" translate in more work to the JIT as shown by https://github.com/quarkusio/quarkus/issues/36204.

If you want to give this a shot you could profile with the -e malloc --total argument which should (see https://github.com/async-profiler/async-profiler/blob/8910a7a46241abc7cb5b80ba6717c3eab603aae6/src/perfEvents_linux.cpp#L194) report the size of native allocations to see if something evident has changed around the mentioned hot parts involved in JIT compiler work (see https://github.com/quarkusio/quarkus/issues/36204#issuecomment-1749129015). This will give the overall footprint saving, apart from the 217 KB reported here (which is still a just once cost).

The overall heap allocation pressure reduction between the 2 versions is ~0.8% which is tiny amount, but still measurable; with the native one you can get the global picture of the improvement, if you wish.

dmlloyd commented 9 months ago

The diff as given doesn't seem to do anything to remove an indy. Also as I said on the WF common change, I think that it's better to use new String(bytes, StandardCharsets.ISO_8859_1). @franz1981 referenced a bug (JDK-8295496) as reason why that wasn't done but the bug seemed to relate to a different ctor, relating to copyOfRange? Maybe it was the wrong #. The code path for the ctor I give above should be nearly identical to the deprecated one used in the PR.

radcortez commented 8 months ago

Yes, I confirm an improvement in allocs RSS for native as well with this PR.

It is somehow significant, but I only noticed it because it was in my analysis path, and it wasn't showing up before on previous versions. We should move forward, but we need to clarify @dmlloyd suggestion first.

dmlloyd commented 8 months ago

How are you measuring, and are you comparing against main? I believe that we are talking about a difference of three byte[4] plus six byte[7 <= n <= 15] plus three String instances allocated out of TLAB during class initialization. I would be astonished if that showed up on RSS reports or CPU time reports compared to our (thousands of) other class initializations.

I'm not arguing against the change per se, but if we're justifying any additional complexity by a measurable performance improvement, then the measurements had better show a real-world improvement that is greater than the margin of error.

franz1981 commented 8 months ago

@dmlloyd

@franz1981 referenced a bug (JDK-8295496) as reason why that wasn't done but the bug seemed to relate to a different ctor, relating to copyOfRange? Maybe it was the wrong #

Nope, sadly the JDK issue is correct, just Claes probaby didn't reported in the issue itself the full analysis which brought me to prepare the reproducer for him, but I will later attach my comment to the PR which explain something more about it.

In the original benchmark reported in the issue, you can find https://gist.github.com/franz1981/bf67ed8f328ed112a524c1150833c718#file-asciicopy-java-L73 which is indeed simplifying the life of the JIT during the generation of the array's copy stub, by having separate call-sites which:

allow the JIT to propagate the information that we always pass an array which length is what we use as String's parameter: given that this information is constant and propagated in that version of the method, it would constant-fold some (but not all! that's why the JDK issue is still opened!) the other checks
this information about array's length propagate till the stub generation which would skip these bound checks, making it as minimal as possible (although never as good the indy which is using an hidden String constuctor and not performing any - sigh - vectorized copy)

Which translate in the conclusion that having a properly sized array payoff in the already slower (if compared to indy) String creation, at 2 level:

the String checks itself
the array's copy one

Related using the String constructor which suggest the encoding type, instead, I can quote what Claes said

I also included String(byte[], int, int, ISO_8859_1) which disappointingly lags slightly behind the deprecated String(byte[], int, int, int).

Although on the PR at https://github.com/openjdk/jdk/pull/12453#discussion_r1099236698 I cannot see any mention about it in the main PR comment, but my previous tests report that one as the worst method to use; I can re-verify it, in case.

The changes at https://bugs.openjdk.org/browse/JDK-8301958 are instead providing a "better" copy method, using the same trick I've suggested, but it will improve just the copy, and not the inherent checks in String itself, which will still be performed (and that's why https://bugs.openjdk.org/browse/JDK-8295496 is still opened, despite the copy improvement!).

Said that, these are facts which should be weighted with the perceived (it's personal, but vs the original version is clearly worsen now!) increase of code complexity vs the actual gain: I've no problem to change the PR or close it and let anyone to propose a different change which still improve the RSS and native usage as this one; the runtime performance is sadly already reduced because of not using the indy, but probably is a small drop in the ocean, if is not the hottest path!

dmlloyd commented 8 months ago

I think the loss of the indy is not a problem for the same reason that the non-deprecated ctor usage is not a problem: it is indeed a drop in the ocean. We do not create many of these address instances in this way, normally. Certainly not on any hot paths. The purpose of these methods is to construct addresses for which a RDNS lookup is never performed, and the code that exists does that pretty well.

The primary question is, do we see a measurable improvement with the patch as it stands, in a real-world application (e.g. a Quarkus application), which is greater than the error%? If not, then I don't think the complexity is warranted (in addition to the message "we provide these methods to construct an address, but do not use them because they are too slow").

dmlloyd commented 8 months ago

Addendum: In retrospect, the best approach would have been to create these as static methods with return a dynamic constant that lazily constructs each address. But, the only way we could possibly move to that approach (leaving out the necessary bytecode processing, which is not trivial) would be to deprecate the fields, which would not fix the problem until they are actually removed. So, I'd say it's too late for that.

franz1981 commented 8 months ago

The primary question is, do we see a measurable improvement with the patch as it stands, in a real-world application (e.g. a Quarkus application), which is greater than the error%?

I can share what I've learnt at https://github.com/smallrye/smallrye-common/pull/276#issuecomment-1892210071: the difference exists in quickstart uses cases, which, although are not representative of any "real" usage IMO, are still used by some users (eg some experts which can share RSS random numbers on their social accounts) - which can affect us, somehow. Maybe @radcortez got some different type of tests, but I believe that ~217KB is still small but surprisingly high cost for something like that (and I didn't checked the RSS native footprint yet) and if we can avoid users to complain about it in public, spending even more cycles, later on, it's an (half) win.

dmlloyd commented 8 months ago

The primary question is, do we see a measurable improvement with the patch as it stands, in a real-world application (e.g. a Quarkus application), which is greater than the error%?

I can share what I've learnt at #276 (comment): the difference exists in quickstart uses cases, which, although are not representative of any "real" usage IMO, are still used by some users (eg some experts which can share RSS random numbers on their social accounts) - which can affect us, somehow. Maybe @radcortez got some different type of tests, but I believe that ~217KB is still small but surprisingly high cost for something like that (and I didn't checked the RSS native footprint yet) and if we can avoid users to complain about it in public, spending even more cycles, later on, it's an (half) win.

Where do you get ~217KB from? Upstream already has the indy-removing change AFAICT... this patch only seems to reuse byte[]. Did I miss something?

radcortez commented 8 months ago

Sorry, to clarify, the measurement is with Quarkus main, and with a replace of Inet from Wildfly with this one in the PR from SR Commons. I know that Wildfly will soon delegate to SR Commons, but I believe that one is not released yet.

How are you measuring, and are you comparing against main

I didn't use any fancy measurement; a plain ps -e -o pid, RSS,args shows the difference between both versions. I believe the increased RSS came from a new Wildfly Commons version built with a more recent JDK.

For reference, this is how it looks like in main with no changes (and with WF Commons 1.7.0):

Screenshot-2024-01-12-at-12 25 25

And this is what it looks like with Quarkus 3.5.0, before the WF Common (1.5.4) update:

Screenshot-2024-01-12-at-12 25 36

And with the PR change:

Screenshot 2024-01-22 at 17 40 16

dmlloyd commented 8 months ago

This however all has no bearing on this PR. WF Common is clearly the main cost here.

To maximize the measurement of this change, I'd recommend building wildfly-common from main (this will be 2.0.0.Final-SNAPSHOT), and then retest the quickstart without and with this PR to see if it makes an impact. The reason for that is since wildfly/wildfly-common#100, WildFly Common will redirect to SmallRye Common for this code, so all of the relevant code paths for this class will come through one route. Then it will be possible to test just this PR in isolation to see if there is indeed a relevant cost savings.

franz1981 commented 8 months ago

@dmlloyd

Where do you get ~217KB from?

It was from the allocation flamegraphs which @radcortez shared in https://github.com/smallrye/smallrye-common/pull/276#issuecomment-1892180542

franz1981 commented 8 months ago

@dmlloyd let me apologize, my brain was still stuck at https://github.com/smallrye/smallrye-common/pull/275 which was already merged and for some reason I was thinking this PR to be that one instead 😂

I have probably misread my own comment at https://github.com/smallrye/smallrye-common/pull/276#issuecomment-1892210071 in which I was providing the full analysis of this and the other PR, altogether, compared to without.

This one is just saving to allocate few bytes (and Strings), because we already know the addresses, in string literal form; said that I have no problem to close it, in case it looks weird or useless...

dmlloyd commented 8 months ago

In that case WDYT of #279 as an API-friendly alternative?

radcortez commented 8 months ago

Here are some additional flamegraphs:

With Wildfly Commons 2.0.0-SNAPSHOT, build with J17:

Screenshot 2024-01-23 at 15 04 06

With SR Commons 2.1.2, and using SR Commons Inet directly in InetSocketAddressConverter:

Screenshot 2024-01-23 at 15 04 34

franz1981 commented 8 months ago

Yeah @radcortez in both I see that the strong concat indy is very relevant

dmlloyd commented 8 months ago

Yeah we all agree about the string concat issue. Now we just need to all agree that string concat is not the subject of this PR nor the subject of this discussion. 😉

franz1981 commented 8 months ago

I see @dmlloyd that @radcortez has been stucked with me on the previous pr ihih

dmlloyd commented 6 months ago

@franz1981 do we still want to pursue this further?

franz1981 commented 6 months ago

We can ask @radcortez if https://github.com/dmlloyd/smallrye-common/commit/22747554f505c037dd124975de2d2078f5b5eca5 (which I like) superseded and fixed this same issue; if it does it, we can safely close it!

radcortez commented 6 months ago

Correct.

smallrye / smallrye-common

Remove wasteful safer Inet4 lookups #276