vigna / fastutil

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues.
Apache License 2.0
1.76k stars 196 forks source link

NPE while putting into Object2IntOpenHashMap #268

Closed tenghuanhe closed 2 years ago

tenghuanhe commented 2 years ago

Hi, NPE while putting into Object2IntOpenHashMap, could anyone help?

java.lang.NullPointerException
 at it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap.rehash(Object2IntOpenHashMap.java:1313) ~[fastutil-8.1.0.jar:?]
 at it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap.insert(Object2IntOpenHashMap.java:322) ~[fastutil-8.1.0.jar:?]
 at it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap.put(Object2IntOpenHashMap.java:331) ~[fastutil-8.1.0.jar:?]
tenghuanhe commented 2 years ago

Hi, @vigna could you please have a look? Thanks very much!!

vigna commented 2 years ago

Check for concurrent access. The maps are fine.

incaseoftrouble commented 2 years ago

Please describe how to reproduce the issue on the current version of fastutil.

tenghuanhe commented 2 years ago

Check for concurrent access. The maps are fine.

Thanks for reply, I'll have a check.

tenghuanhe commented 2 years ago

Please describe how to reproduce the issue on the current version of fastutil.

Thanks for reply @incaseoftrouble, It's hard to reproduce this issue manually, but it occurs every several days in our prod env. I'll attach the code later.

tenghuanhe commented 2 years ago

Hi, @vigna @incaseoftrouble Here is the code snippet where Object2IntOpenHashMap is used, as well as NPE occured. https://github.com/apache/druid/blob/c2fca2744a5427a823f6521ccdba2d1548ee85d9/processing/src/main/java/io/druid/segment/StringDimensionIndexer.java#L101-L116

The map valueToId is well synchronized and no other places written into.

    public int add(String originalValue)
    {
      String value = Strings.nullToEmpty(originalValue);
      synchronized (lock) {
        int prev = valueToId.getInt(value);
        if (prev >= 0) {
          return prev;
        }
        final int index = size();
        valueToId.put(value, index);
        idToValue.add(value);
        minValue = minValue == null || minValue.compareTo(value) > 0 ? value : minValue;
        maxValue = maxValue == null || maxValue.compareTo(value) < 0 ? value : maxValue;
        return index;
      }
    }

Detailed stack:

java.lang.NullPointerException
 at it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap.rehash(Object2IntOpenHashMap.java:1313) ~[fastutil-8.1.0.jar:?]
 at it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap.insert(Object2IntOpenHashMap.java:322) ~[fastutil-8.1.0.jar:?]
 at it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap.put(Object2IntOpenHashMap.java:331) ~[fastutil-8.1.0.jar:?]
 at io.druid.segment.StringDimensionIndexer$DimensionDictionary.add(StringDimensionIndexer.java:110) ~[druid-processing-0.12.3-iap7-failover.jar:0.12.3-iap7]
 at io.druid.segment.StringDimensionIndexer.processRowValsToUnsortedEncodedKeyComponent(StringDimensionIndexer.java:234) ~[druid-processing-0.12.3-iap7-failover.jar:0.12.3-iap7]
 at io.druid.segment.StringDimensionIndexer.processRowValsToUnsortedEncodedKeyComponent(StringDimensionIndexer.java:59) ~[druid-processing-0.12.3-iap7-failover.jar:0.12.3-iap7]
 at io.druid.segment.incremental.IncrementalIndex.toTimeAndDims(IncrementalIndex.java:632) ~[druid-processing-0.12.3-iap7-failover.jar:0.12.3-iap7]
 at io.druid.segment.incremental.IncrementalIndex.add(IncrementalIndex.java:572) ~[druid-processing-0.12.3-iap7-failover.jar:0.12.3-iap7]
 at io.druid.segment.realtime.plumber.Sink.add(Sink.java:158) ~[druid-server-0.12.3-iap7.jar:0.12.3-iap7]
 at io.druid.segment.realtime.plumber.RealtimePlumber.add(RealtimePlumber.java:228) ~[druid-server-0.12.3-iap7.jar:0.12.3-iap7]
 at io.druid.segment.realtime.plumber.Plumbers.addNextRow(Plumbers.java:69) ~[druid-server-0.12.3-iap7.jar:0.12.3-iap7]
 at io.druid.indexing.common.task.RealtimeIndexTask.run(RealtimeIndexTask.java:397) ~[druid-indexing-service-0.12.3-iap7.jar:0.12.3-iap7]
 at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:457) [druid-indexing-service-0.12.3-iap7.jar:0.12.3-iap7]
 at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:429) [druid-indexing-service-0.12.3-iap7.jar:0.12.3-iap7]
 at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]

Do you have any ideas? Thanks again.

incaseoftrouble commented 2 years ago

It's hard to reproduce this issue manually, but it occurs every several days

this smells a lot like concurrency issues, but the code really seems to be fine in that regard.

The objects you put in it are immutable (strings / ints), you never put nulls in there, I can't really see what is going wrong. For reference, the 8.1.0 rehash code is:

protected void rehash(final int newN) {
    final K key[] = this.key;
    final int value[] = this.value;
    final int mask = newN - 1;
    final K newKey[] = (K[]) new Object[newN + 1];
    final int newValue[] = new int[newN + 1];
    int i = n, pos;

    for (int j = realSize(); j-- != 0;) {
        while (((key[--i]) == null));

        // NPE happens in this line
        if (!((newKey[pos = (it.unimi.dsi.fastutil.HashCommon.mix((key[i]).hashCode())) & mask]) == null))
            while (!((newKey[pos = (pos + 1) & mask]) == null));

        newKey[pos] = key[i];
        newValue[pos] = value[i];
    }

    newValue[newN] = value[n];
    n = newN;
    this.mask = mask;
    maxFill = maxFill(n, f);
    this.key = newKey;
    this.value = newValue;
}

and is effectively the same as the 8.5.6 code.

I can't even see how a NPE can possibly occur there except if there were concurrent access (key[i] and newKey have to be non-null at that line except if key is modified concurrently). But the reference to the map (valueToId) is not being leaked and always guarded by a lock.

tenghuanhe commented 2 years ago

I can't even see how a NPE can possibly occur there except if there were concurrent access (key[i] and newKey have to be non-null at that line except if key is modified concurrently). But the reference to the map (valueToId) is not being leaked and always guarded by a lock.

@incaseoftrouble thanks for your quick reply. Exactly!! I am also very confused, how could this happen 😕😕?

szarnekow commented 2 years ago

How many entries are in your map? Can you share the sequence of keys that is being used?

vigna commented 2 years ago

OK. I see this old version has a significant problem—null keys will cause an exception. The current version has a separate test for null keys. I'd suggest trying that first. Still, it is not clear to me why there would be null keys in the map.

tenghuanhe commented 2 years ago

@szarnekow Thank you for your reply.

How many entries are in your map?

The map size can be 200k or more.

Can you share the sequence of keys that is being used?

Sorry but the keys were not dumped when NPE occurred.

@vigna @incaseoftrouble Speaking of the keys being used, there is another confusing thing we encountered. If taking a closer look at the code snippet above, we can find the variable List<String> idToValue will never have duplicate values in it, but it did happen, at the same time the content of map valueToId is even more wired. As we can see valueToId is a String -> Integer map, but we found keys of type byte[], char[], int[], java.io.DataOutputStream and others when opening the heap dump file with MAT(Memory Analyzer Tool) as shown below (Sorry that the value of Strings in the figure is obscured for reasons of confidentiality). image

vigna commented 2 years ago

From the logs, in March 2017 there was a major code cleanup. That was the moment in which the bug was fixed. You should definitely use a more recent version.

vigna commented 2 years ago

Well, you definitely have to look for concurrent access. There no way the insertion in that piece of code can put in keys of the wrong type. So upgrading will not solve your problem.

incaseoftrouble commented 2 years ago

@tenghuanhe This really seems like an issue on your side. Possible culprits might be actual hardware problems, faulty JVM / memory management on the OS level, or reflection being used (or some other things which bypass visibility / type safety). These objects cannot be in there simply by type safety guarantees of the JVM. The only way how references get into the map is through the add method and this is typed as String.

Are you maybe using JNI / native calls or JVM agents? These can also mess with memory I think.

Since the maps are so large, it could be a memory issue on OS / JVM level. Which JVM are you using?

tenghuanhe commented 2 years ago

Well, you definitely have to look for concurrent access. There no way the insertion in that piece of code can put in keys of the wrong type. So upgrading will not solve your problem.

@vigna thank you I'll have a check on concurrent access again and consider upgrading if necessary.

tenghuanhe commented 2 years ago

@incaseoftrouble Thanks again for you kindness and reply.

Are you maybe using JNI / native calls or JVM agents? These can also mess with memory I think.

I am not sure whether JNI/native calls are used and will have a check. JVM agents are not used. The problem started to appear after a feature using file MMAP added to the codebase (first the data is written to disk file using MMAP, then reading from the disk file using MMAP and add the data to the Object2IntOpenHashMap), is that possible to mess the memory? or have you encountered a similar situation before?

Since the maps are so large, it could be a memory issue on OS / JVM level. Which JVM are you using?

The JVM and kernel version:

~$ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
~$ uname -r
4.19.95-13
incaseoftrouble commented 2 years ago

MMAP definitely sounds like something that can mess with memory. I guess you are using Java's memory mapped buffers though? These should be fine.

Maybe you are (de)serializing objects in this memory mapping and this causes issues (objects of the wrong type are being created). Maybe add some type safety assertions along the way from the mmap reading to the add call (assert foo instanceof String). But not sure if something could actually go wrong in this way.

vigna commented 2 years ago

I think there are instrumentation frameworks what will let you add asserts at load time. You could use one of those to put type asserts in all put methods. But mmap sounds like the source of your problems.

tenghuanhe commented 2 years ago

I guess you are using Java's memory mapped buffers though?

@incaseoftrouble yes, we are using java.nio.MappedByteBuffer like this

FileChannel channel = FileChannel.open(currentWriteFile.toPath(), StandardOpenOption.READ, StandardOpenOption.WRITE);
MappedByteBuffer mappedByteBuffer = channel.map(FileChannel.MapMode.READ_WRITE, 0, fileSize);

I think there are instrumentation frameworks what will let you add asserts at load time.

Thanks @vigna I'll search for it.

But mmap sounds like the source of your problems.

Agreed, I am going to check the mmap usage in our code.

Since it's not a Object2IntOpenHashMap problem, I think it's okay to close this issue?

Thanks again for your help @vigna @incaseoftrouble.