pemistahl / lingua

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Apache License 2.0
706 stars 63 forks source link

Compact memory data (#101) #127

Closed fvasco closed 2 years ago

fvasco commented 2 years ago

I changed the runtime memory model, the original JSON is translated to a dense map. This reduces memory requirements at cost of speed (frequencies lookup should be slower). Frequencies are stored as Float instead of Double, this introduces an 0.001% error on calculation, and tests are updated accordingly.

fastutil dependency has been removed.

All changes are performed in internal classes, so this request is compatible with the 1.1 version and I hope that the merge will be considered soon.

pemistahl commented 2 years ago

Thanks for your PR, Francesco. I will evaluate your changes soon and let you know whether I want to merge them or not. Please be a little patient. My todo list is growing faster than I expected (which is a good thing :).

fvasco commented 2 years ago

Happy to hear that, @pemistahl. Please consider that this version preserves JSON data files, which are really inefficient. Should be really better to use a binary data source (which loads really faster and requires less memory, already described here).

sigpwned commented 2 years ago

Hello! I realize I'm dropping in from nowhere, but I'm working on a project that involves language detection and saw this work was evolving in real time, had a little bit of spare time on a Sunday afternoon, and thought I might be able to add some value. So here I am!

I was a software performance analyst in a past life, and helping think through changes like this was an important part of that role. When optimizing programs, you quickly learn that the first thing you need is data. It turns out that the Pareto principle applies to software performance: 90% of time is spent in 10% of code. It also turns out that humans -- even experts who have made a career out of optimizing programs all day long -- are very bad at guessing which 10% matters! So if we don't measure, it's really hard to know if the work we do is having the desired effect.

To that end, I put together this very simple benchmark of lingua performance using JMH. There's a little bit of test harness to unpack in that repository, but most of the work is here:

@Fork(value = 3)
@OutputTimeUnit(TimeUnit.SECONDS)
@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class DetectBenchmark {
  public LanguageDetector detector;

  /**
   * Contains exactly 1MB of "random" data sampled from Twitter streaming API. Visually confirmed to
   * be multi-language.
   */
  public List<String> lines;

  @Setup
  public void setupDetectBenchmark() throws IOException {
    detector = LanguageDetectorBuilder.fromAllLanguages().withPreloadedLanguageModels().build();
    try (BufferedReader in = new BufferedReader(new InputStreamReader(
        new GZIPInputStream(Resources.getResource("tweets.txt.gz").openStream()),
        StandardCharsets.UTF_8))) {
      lines = in.lines().collect(toList());
    }

  }

  /*
   * @formatter:off
   * 
   * As of 2022-03-27:
   * 
   * Benchmark                         Mode  Cnt    Score   Error  Units
   * EmojiJavaBenchmark.tweets        thrpt   15  105.708 ± 0.401  ops/s
   * 
   * @formatter:on
   */
  @Benchmark
  public void detect(Blackhole blackhole) {
    Set<Language> languages = EnumSet.noneOf(Language.class);
    for (String line : lines)
      languages.add(detector.detectLanguageOf(line));
    blackhole.consume(languages);
  }
}

Essentially, the benchmark initializes by creating a language detector for all languages with preloaded models and loading about 20K tweets pulled from social media. The benchmark itself then detects language for each tweet, over and over, as fast as it can. The JMH framework then reports performance characteristics for the benchmark. I ran the benchmark on the current code, and then on the proposed change. To be fully transparent, I did have to make a couple of (cosmetic) changes to the PR to get the benchmark to work, and you can review those changes here. (@fvasco, if you got any notifications about that work, then I apologize for the distraction, it is entirely because I have fat fingers.)

Here is the performance of the current code:

Benchmark                                                 Mode  Cnt           Score            Error   Units
DetectBenchmark.detect                                   thrpt    9           0.197 ±          0.005   ops/s
DetectBenchmark.detect:·gc.alloc.rate                    thrpt    9        1355.646 ±         43.312  MB/sec
DetectBenchmark.detect:·gc.alloc.rate.norm               thrpt    9  7525093881.037 ±     366615.998    B/op

Here is the performance of the current code with the proposed change:

Benchmark                                                 Mode  Cnt           Score            Error   Units
DetectBenchmark.detect                                   thrpt    9           0.151 ±          0.002   ops/s
DetectBenchmark.detect:·gc.alloc.rate                    thrpt    9         928.336 ±         88.560  MB/sec
DetectBenchmark.detect:·gc.alloc.rate.norm               thrpt    9  6688520282.222 ±  659955119.790    B/op

The net results here are:

Software optimization is hard. Everything is connected to everything else, often in unpredictable ways. In this case, it looks like the change has made a nice reduction in memory usage as well as an unfortunate reduction in wall clock performance. I suspect that someone could quickly figure out what is going on with some light profiling in a nice, free tool like visualvm and make some quick performance improvements. I suspect the tool could also inform any memory usage planning, too. I would be happy to help, if that's useful.

I do apologize for popping in here from nowhere. I hope it's not rude. I certainly did not pop in to criticize work (to the contrary, this is good stuff, @fvasco!), or to tell anyone what to do or not to do (@pemistahl, it's your show!). However, it's rare that I find my background to be relevant to work going on in real time, and I couldn't resist the urge to poke my head in. If this is unwelcome, then please do tell me to buzz off, and I shall!

fvasco commented 2 years ago

Hi, @sigpwned, thank you for your engagement. I am a random visitor like you.

I suspect that you measured the memory allocation rate for each run, my goal is to reduce the memory requirements, unfortunately, this library requires too much RAM, so we choose another one.

I expected a performance drop, but 23% is a sensible drop, so I have decided to rework this PR to try to improve performance.

If your environment is still ready, can evaluate this PR, again?

Thank you in advance.

sigpwned commented 2 years ago

I suspect that you measured the memory allocation rate for each run, my goal is to reduce the memory requirements, unfortunately, this library requires too much RAM, so we choose another one.

Ah! You are correct, and that is an important difference. I think I can help there too. Hopefully I'll come back in a bit with some more information on that.

If your environment is still ready, can evaluate this PR, again?

I would be very happy to! If it's useful, I'd also be happy to spend a little time in a profiler to see if I can't identify where we're losing performance.

If performance (whether wall clock or memory) is becoming a major focus, then I think it might also make sense to integrate some basic benchmarks into PR checks. @pemistahl, does that sound interesting or useful? If so, I'd be happy to take a swing at it in a separate PR!

sigpwned commented 2 years ago

OK! Using jol, I was able to pull object graph memory footprints. Short version here, long version below.

For an object:

LanguageDetector detector =
        LanguageDetectorBuilder.fromAllLanguages().withPreloadedLanguageModels().build();

The object graph memory footprints are as follows:

v1.1.1 in maven -- 2,281,277,024 B ≈ 2.28GB this PR -- 385,479,016 B ≈ 385MB

So substantially reduced memory footprint, roughly 380MB as advertised. Huge improvement, @fvasco!

I've updated the code in the benchmark repo in case anyone is curious. The memory usage is a little ticklish to run because it wants elevated permissions, but it should at least be pretty clear what's going on.

Full dumps are included below, in case anyone is curious.

The original code has the following footprint:

com.github.pemistahl.lingua.api.LanguageDetector@f0f2775d footprint:
     COUNT       AVG       SUM   DESCRIPTION
  31118328        25 795811136   [B
       375   1311825 491934544   [D
        72      3780    272160   [I
         1        16        16   [J
       210        31      6592   [Ljava.lang.Class;
        14       144      2016   [Ljava.lang.ClassValue$Entry;
       986    249494 246001872   [Ljava.lang.Object;
         1        24        24   [Ljava.lang.String;
         3        53       160   [Ljava.lang.Thread;
         1        32        32   [Ljava.lang.ThreadGroup;
        11        80       880   [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
         9        40       360   [Ljava.lang.invoke.BoundMethodHandle$SpeciesData;
       112        47      5264   [Ljava.lang.invoke.LambdaForm$Name;
         4        32       128   [Ljava.lang.invoke.LambdaFormEditor$Transform;
        29       205      5960   [Ljava.lang.invoke.MethodHandle;
       182        76     13832   [Ljava.lang.ref.SoftReference;
         1        24        24   [Ljava.lang.reflect.Constructor;
        40        43      1736   [Ljava.lang.reflect.Field;
         5        41       208   [Ljava.lang.reflect.Method;
         2        24        48   [Ljava.lang.reflect.TypeVariable;
         3        21        64   [Ljava.security.CodeSigner;
        12        16       192   [Ljava.security.Principal;
        11        24       264   [Ljava.security.ProtectionDomain;
         1        32        32   [Ljava.security.cert.Certificate;
        67       102      6888   [Ljava.util.HashMap$Node;
         2       304       608   [Ljava.util.Hashtable$Entry;
         1        32        32   [Ljava.util.Map$Entry;
        45        80      3600   [Ljava.util.WeakHashMap$Entry;
        33       283      9360   [Ljava.util.concurrent.ConcurrentHashMap$Node;
         1        16        16   [Lsun.instrument.TransformerManager$TransformerInfo;
        11        32       352   [Lsun.nio.fs.NativeBuffer;
         2        16        32   [Lsun.reflect.generics.tree.ClassTypeSignature;
         3        24        72   [Lsun.reflect.generics.tree.FieldTypeSignature;
         2        24        48   [Lsun.reflect.generics.tree.FormalTypeParameter;
         5        16        80   [Lsun.reflect.generics.tree.TypeArgument;
        46        24      1104   [Lsun.security.x509.AVA;
        12        32       392   [Lsun.security.x509.RDN;
         5        24       120   [Z
        75        24      1800   com.github.pemistahl.lingua.api.IsoCode639_1
        75        24      1800   com.github.pemistahl.lingua.api.IsoCode639_3
        75        40      3000   com.github.pemistahl.lingua.api.Language
         1        64        64   com.github.pemistahl.lingua.api.LanguageDetector
        18        24       432   com.github.pemistahl.lingua.internal.Alphabet
       375        32     12000   com.github.pemistahl.lingua.internal.TrainingDataLanguageModel
         1        16        16   com.sun.management.internal.PlatformMBeanProviderImpl$$Lambda$31/0x00000008000e1f18
       375        72     27000   it.unimi.dsi.fastutil.objects.Object2DoubleOpenHashMap
        21        32       672   java.io.File
         1        24        24   java.io.File$PathStatus
        22        56      1232   java.io.FileCleanable
        22        40       880   java.io.FileDescriptor
         2        32        64   java.io.FileInputStream
        20        32       640   java.io.RandomAccessFile
         2        16        32   java.lang.Boolean
        18        24       432   java.lang.Character$UnicodeScript
       135       192     25920   java.lang.Class
         1        16        16   java.lang.Class$$Lambda$44/0x00000008000f79c8
        23        64      1472   java.lang.Class$ReflectionData
        14        64       896   java.lang.ClassValue$ClassValueMap
        15        32       480   java.lang.ClassValue$Entry
         1        16        16   java.lang.ClassValue$Identity
         1        24        24   java.lang.ClassValue$Version
        10        16       160   java.lang.Integer
         2        56       112   java.lang.Module
         1        16        16   java.lang.Module$$Lambda$60/0x000000080015dba0
        43        16       688   java.lang.Object
         1        16        16   java.lang.ProcessHandleImpl$$Lambda$42/0x00000008000f5f18
         1        24        24   java.lang.ProcessHandleImpl$$Lambda$43/0x00000008000f6138
         2        32        64   java.lang.Runtime$Version
        11        24       264   java.lang.RuntimePermission
  31118003        24 746832072   java.lang.String
         1        16        16   java.lang.System$LoggerFinder$$Lambda$75/0x000000080018a580
        15       368      5520   java.lang.Thread
         3        48       144   java.lang.ThreadGroup
         2        16        32   java.lang.ThreadLocal
        11        24       264   java.lang.ThreadLocal$ThreadLocalMap
        26        32       832   java.lang.ThreadLocal$ThreadLocalMap$Entry
         1        56        56   java.lang.invoke.BoundMethodHandle$Specializer
         1        48        48   java.lang.invoke.BoundMethodHandle$Specializer$Factory
         9        48       432   java.lang.invoke.BoundMethodHandle$SpeciesData
        39        32      1248   java.lang.invoke.BoundMethodHandle$Species_L
         9        40       360   java.lang.invoke.BoundMethodHandle$Species_LJ
        25        40      1000   java.lang.invoke.BoundMethodHandle$Species_LL
         4        48       192   java.lang.invoke.BoundMethodHandle$Species_LLLL
         1        48        48   java.lang.invoke.BoundMethodHandle$Species_LLLLL
         1        56        56   java.lang.invoke.BoundMethodHandle$Species_LLLLLL
         3        56       168   java.lang.invoke.BoundMethodHandle$Species_LLLLLLL
        77        24      1848   java.lang.invoke.ConstantCallSite
        38        32      1216   java.lang.invoke.DirectMethodHandle
        30        40      1200   java.lang.invoke.DirectMethodHandle$Accessor
        20        40       800   java.lang.invoke.DirectMethodHandle$Constructor
        20        24       480   java.lang.invoke.Invokers
       112        48      5376   java.lang.invoke.LambdaForm
         4        32       128   java.lang.invoke.LambdaForm$BasicType
        12        32       384   java.lang.invoke.LambdaForm$Kind
       440        32     14080   java.lang.invoke.LambdaForm$Name
       140        24      3360   java.lang.invoke.LambdaForm$NamedFunction
        36        48      1728   java.lang.invoke.LambdaFormEditor$Transform
       294        48     14112   java.lang.invoke.MemberName
        77        32      2464   java.lang.invoke.MethodHandleNatives$CallSiteContext
       243        40      9720   java.lang.invoke.MethodType
        93        32      2976   java.lang.invoke.MethodTypeForm
       241        24      5784   java.lang.invoke.ResolvedMethodName
         1        16        16   java.lang.management.DefaultPlatformMBeanProvider$5$$Lambda$66/0x0000000800180818
         1        16        16   java.lang.management.ManagementFactory$$Lambda$30/0x00000008000deb28
         1        16        16   java.lang.management.ManagementFactory$$Lambda$53/0x000000080017d2a8
         1        16        16   java.lang.management.ManagementFactory$$Lambda$54/0x000000080017d4f8
         1        16        16   java.lang.management.ManagementFactory$$Lambda$55/0x000000080017dc30
        62        64      3968   java.lang.module.ModuleDescriptor
         1        16        16   java.lang.module.ModuleDescriptor$Builder$$Lambda$59/0x000000080015d970
       362        24      8688   java.lang.module.ModuleDescriptor$Exports
         4        24        96   java.lang.module.ModuleDescriptor$Opens
        60        24      1440   java.lang.module.ModuleDescriptor$Provides
       132        32      4224   java.lang.module.ModuleDescriptor$Requires
         2        24        48   java.lang.module.ModuleDescriptor$Requires$Modifier
         1        32        32   java.lang.module.ModuleDescriptor$Version
         1        16        16   java.lang.ref.Cleaner
         1       376       376   java.lang.ref.Finalizer$FinalizerThread
         1       368       368   java.lang.ref.Reference$ReferenceHandler
        46        32      1472   java.lang.ref.ReferenceQueue
        47        16       752   java.lang.ref.ReferenceQueue$Lock
         1        32        32   java.lang.ref.ReferenceQueue$Null
       126        40      5040   java.lang.ref.SoftReference
         1        72        72   java.lang.reflect.Constructor
       258        72     18576   java.lang.reflect.Field
        20        88      1760   java.lang.reflect.Method
         1        16        16   java.lang.reflect.Proxy$$Lambda$57/0x000000080015d0f0
         1        16        16   java.lang.reflect.Proxy$ProxyBuilder$$Lambda$58/0x000000080015d530
         1        16        16   java.lang.reflect.ProxyGenerator$$Lambda$63/0x000000080015e768
         1        16        16   java.lang.reflect.ProxyGenerator$$Lambda$64/0x000000080015e9a8
        15        40       600   java.math.BigInteger
        75        80      6000   java.net.URI
        45        64      2880   java.net.URL
         2        64       128   java.nio.DirectByteBuffer
         1        56        56   java.nio.HeapByteBuffer
        18        40       720   java.security.AccessControlContext
        11        24       264   java.security.BasicPermissionCollection
         1        24        24   java.security.CodeSigner
        11        40       440   java.security.CodeSource
        11        24       264   java.security.Permissions
        12        40       480   java.security.ProtectionDomain
        12        16       192   java.security.ProtectionDomain$Key
        11        16       176   java.security.SecureClassLoader$CodeSourceKey
         1        24        24   java.security.Timestamp
         6        32       192   java.security.cert.PolicyQualifierInfo
        32        24       768   java.util.ArrayDeque
        34        24       816   java.util.ArrayList
         1        16        16   java.util.Collections$EmptyList
         1        16        16   java.util.Collections$EmptySet
        42        24      1008   java.util.Collections$SetFromMap
        74        16      1184   java.util.Collections$SingletonSet
         5        32       160   java.util.Collections$SynchronizedMap
         1        32        32   java.util.Collections$UnmodifiableMap
         2        24        48   java.util.Collections$UnmodifiableRandomAccessList
         1        16        16   java.util.Collections$UnmodifiableSet
        11        24       264   java.util.Date
        13        48       624   java.util.HashMap
         2        16        32   java.util.HashMap$KeySet
       141        32      4512   java.util.HashMap$Node
         2        16        32   java.util.HashSet
         2        48        96   java.util.Hashtable
        44        32      1408   java.util.Hashtable$Entry
        11        40       440   java.util.IdentityHashMap
        11        16       176   java.util.IdentityHashMap$KeySet
        68        24      1632   java.util.ImmutableCollections$List12
        32        24       768   java.util.ImmutableCollections$ListN
       201        24      4824   java.util.ImmutableCollections$Set12
       114        24      2736   java.util.ImmutableCollections$SetN
        62        56      3472   java.util.LinkedHashMap
       551        40     22040   java.util.LinkedHashMap$Entry
        32        16       512   java.util.LinkedHashMap$LinkedEntrySet
         1        16        16   java.util.LinkedHashMap$LinkedKeySet
         6        16        96   java.util.LinkedHashSet
         1        16        16   java.util.Optional
         5        48       240   java.util.TreeMap
        38        40      1520   java.util.TreeMap$Entry
         4        32       128   java.util.Vector
        31        48      1488   java.util.WeakHashMap
        14        40       560   java.util.WeakHashMap$Entry
        31        16       496   java.util.WeakHashMap$KeySet
        40        64      2560   java.util.concurrent.ConcurrentHashMap
      1306        32     41792   java.util.concurrent.ConcurrentHashMap$Node
         3        16        48   java.util.concurrent.ConcurrentHashMap$ValuesView
         1        24        24   java.util.concurrent.Executors$DefaultThreadFactory
         1        48        48   java.util.concurrent.LinkedBlockingQueue
         1        24        24   java.util.concurrent.LinkedBlockingQueue$Node
         1        32        32   java.util.concurrent.SynchronousQueue
         1        16        16   java.util.concurrent.SynchronousQueue$TransferStack
         1        32        32   java.util.concurrent.SynchronousQueue$TransferStack$SNode
         2        72       144   java.util.concurrent.ThreadPoolExecutor
         1        16        16   java.util.concurrent.ThreadPoolExecutor$AbortPolicy
        11        48       528   java.util.concurrent.ThreadPoolExecutor$Worker
         4        16        64   java.util.concurrent.atomic.AtomicInteger
        10        32       320   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode
         4        24        96   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject
         4        16        64   java.util.concurrent.locks.ReentrantLock
         4        32       128   java.util.concurrent.locks.ReentrantLock$NonfairSync
         1        16        16   java.util.function.Function$$Lambda$67/0x0000000800180a58
        50        16       800   java.util.jar.Attributes
        26        24       624   java.util.jar.Attributes$Name
        20        56      1120   java.util.jar.JarFile
        31       104      3224   java.util.jar.JarFile$JarFileEntry
         1        88        88   java.util.jar.JarVerifier
         7        24       168   java.util.jar.Manifest
         1        16        16   java.util.logging.Level$$Lambda$73/0x00000008001893f0
         1        16        16   java.util.logging.Level$KnownLevel$$Lambda$71/0x0000000800184468
         1        16        16   java.util.logging.Level$KnownLevel$$Lambda$72/0x00000008001846a8
         1        16        16   java.util.logging.Level$KnownLevel$$Lambda$74/0x0000000800189630
         1        16        16   java.util.regex.CharPredicates$$Lambda$27/0x00000008000da528
         1        16        16   java.util.regex.CharPredicates$$Lambda$39/0x00000008000f3328
         1        16        16   java.util.stream.Collectors$$Lambda$36/0x00000008000e92c0
         1        16        16   java.util.stream.Collectors$$Lambda$37/0x00000008000e94e0
         1        16        16   java.util.stream.Collectors$$Lambda$38/0x00000008000e9710
         1        16        16   java.util.stream.Collectors$$Lambda$46/0x00000008000f7e30
         1        16        16   java.util.stream.Collectors$$Lambda$47/0x00000008000f8060
         1        16        16   java.util.stream.Collectors$$Lambda$48/0x00000008000f82a8
         1        16        16   java.util.stream.Collectors$$Lambda$68/0x0000000800180c98
         1        16        16   java.util.stream.Collectors$$Lambda$70/0x00000008001810f0
        37        64      2368   java.util.zip.Inflater
        37        24       888   java.util.zip.Inflater$InflaterZStreamRef
         1        32        32   java.util.zip.ZipCoder$UTF8ZipCoder
        31        32       992   java.util.zip.ZipFile$CleanableResource
        20        80      1600   java.util.zip.ZipFile$Source
        20        24       480   java.util.zip.ZipFile$Source$Key
        10        16       160   javax.security.auth.x500.X500Principal
         1        16        16   jdk.internal.jimage.ImageBufferCache$1
         1       104       104   jdk.internal.loader.ClassLoaders$AppClassLoader
         1       104       104   jdk.internal.loader.ClassLoaders$BootClassLoader
         1       104       104   jdk.internal.loader.ClassLoaders$PlatformClassLoader
         1        40        40   jdk.internal.loader.URLClassPath
         1        24        24   jdk.internal.loader.URLClassPath$FileLoader
        20        48       960   jdk.internal.loader.URLClassPath$JarLoader
         1       376       376   jdk.internal.misc.InnocuousThread
         1        16        16   jdk.internal.misc.TerminatingThreadLocal$1
         1        24        24   jdk.internal.module.ModuleHashes
        62        56      3472   jdk.internal.module.ModuleReferenceImpl
         1        16        16   jdk.internal.module.ModuleTarget
        62        24      1488   jdk.internal.module.SystemModuleFinders$2
        60        16       960   jdk.internal.module.SystemModuleFinders$3
        62        24      1488   jdk.internal.module.SystemModuleFinders$SystemModuleReader
         1        16        16   jdk.internal.perf.Perf
         1        24        24   jdk.internal.perf.Perf$CleanerAction
         1        24        24   jdk.internal.ref.CleanerImpl
         1        40        40   jdk.internal.ref.CleanerImpl$CleanerCleanable
       158        48      7584   jdk.internal.ref.CleanerImpl$PhantomCleanableRef
         1        16        16   jdk.jfr.internal.dcmd.DCmdCheck$$Lambda$79/0x0000000800191308
         1        16        16   jdk.jfr.internal.dcmd.DCmdDump$$Lambda$78/0x00000008001910e8
         1        16        16   jdk.jfr.internal.dcmd.DCmdStart$$Lambda$77/0x0000000800190ec8
         1        16        16   jdk.jfr.internal.dcmd.DCmdStop$$Lambda$76/0x000000080018e4b8
         1        16        16   kotlin.collections.CollectionsKt___CollectionsKt$asSequence$$inlined$Sequence$1
         1        16        16   kotlin.collections.EmptyMap
         1        24        24   org.openjdk.jol.info.AbstractGraphWalker$ReferenceFieldsClassValue
         1        96        96   org.openjdk.jol.vm.HotspotUnsafe
         1        24        24   org.openjdk.jol.vm.HotspotUnsafe$1
         1        48        48   org.openjdk.jol.vm.HotspotUnsafe$Sizes
         1        32        32   sun.instrument.InstrumentationImpl
         1        24        24   sun.instrument.TransformerManager
         4        56       224   sun.invoke.util.Wrapper
         1        16        16   sun.management.ManagementFactoryHelper$$Lambda$80/0x0000000800194e58
         1        16        16   sun.misc.Unsafe
         1        16        16   sun.net.www.protocol.file.Handler
         1        16        16   sun.net.www.protocol.jar.Handler
         1        16        16   sun.net.www.protocol.jar.JarFileFactory
        11        72       792   sun.net.www.protocol.jar.URLJarFile
         1        16        16   sun.net.www.protocol.jrt.Handler
         1        24        24   sun.nio.cs.UTF_8
         1        32        32   sun.nio.fs.MacOSXFileSystem
         1        16        16   sun.nio.fs.MacOSXFileSystemProvider
        11        32       352   sun.nio.fs.NativeBuffer
        11        24       264   sun.nio.fs.NativeBuffer$Deallocator
         1        16        16   sun.nio.fs.NativeBuffers$1
        20       128      2560   sun.nio.fs.UnixFileAttributes
        20        16       320   sun.nio.fs.UnixFileAttributes$UnixAsBasicFileAttributes
         2        32        64   sun.nio.fs.UnixFileKey
        32        32      1024   sun.nio.fs.UnixPath
         2        24        48   sun.reflect.generics.factory.CoreReflectionFactory
         3        32        96   sun.reflect.generics.reflectiveObjects.TypeVariableImpl
         2        32        64   sun.reflect.generics.repository.ClassRepository
         2        24        48   sun.reflect.generics.scope.ClassScope
         2        24        48   sun.reflect.generics.tree.ClassSignature
         5        16        80   sun.reflect.generics.tree.ClassTypeSignature
         3        24        72   sun.reflect.generics.tree.FormalTypeParameter
         5        24       120   sun.reflect.generics.tree.SimpleClassTypeSignature
         2        24        48   sun.security.provider.certpath.X509CertPath
         5        48       240   sun.security.rsa.RSAPublicKeyImpl
         1        32        32   sun.security.rsa.RSAUtil$KeyType
         5        24       120   sun.security.util.BitArray
        46        40      1840   sun.security.util.DerInputStream
        46        32      1472   sun.security.util.DerValue
        11        24       264   sun.security.util.LazyCodeSourcePermissionCollection
        88        32      2816   sun.security.util.ObjectIdentifier
        46        24      1104   sun.security.x509.AVA
         7        24       168   sun.security.x509.AccessDescription
        15        24       360   sun.security.x509.AlgorithmId
         4        32       128   sun.security.x509.AuthorityInfoAccessExtension
         5        40       200   sun.security.x509.AuthorityKeyIdentifierExtension
         5        32       160   sun.security.x509.BasicConstraintsExtension
         4        32       128   sun.security.x509.CRLDistributionPointsExtension
         5        16        80   sun.security.x509.CertificateAlgorithmId
         5        24       120   sun.security.x509.CertificateExtensions
         4        32       128   sun.security.x509.CertificatePoliciesExtension
         6        16        96   sun.security.x509.CertificatePolicyId
         5        16        80   sun.security.x509.CertificateSerialNumber
         5        24       120   sun.security.x509.CertificateValidity
         5        16        80   sun.security.x509.CertificateVersion
         5        16        80   sun.security.x509.CertificateX509Key
        13        16       208   sun.security.x509.DNSName
         6        32       192   sun.security.x509.DistributionPoint
         4        32       128   sun.security.x509.ExtendedKeyUsageExtension
        15        16       240   sun.security.x509.GeneralName
         8        16       128   sun.security.x509.GeneralNames
        10        16       160   sun.security.x509.KeyIdentifier
         5        32       160   sun.security.x509.KeyUsageExtension
         6        24       144   sun.security.x509.PolicyInformation
        46        24      1104   sun.security.x509.RDN
         5        16        80   sun.security.x509.SerialNumber
         2        32        64   sun.security.x509.SubjectAlternativeNameExtension
         5        32       160   sun.security.x509.SubjectKeyIdentifierExtension
        13        32       416   sun.security.x509.URIName
        12        48       576   sun.security.x509.X500Name
         5        80       400   sun.security.x509.X509CertImpl
         5        56       280   sun.security.x509.X509CertInfo
         1        16        16   sun.tools.attach.HotSpotVirtualMachine$$Lambda$41/0x0000000800142d50
  62247818           2281277024   (total)

The proposed code change has the following footprint:

com.github.pemistahl.lingua.api.LanguageDetector@f0f2775d footprint:
     COUNT       AVG       SUM   DESCRIPTION
      2909       766   2230680   [B
      1875    137561 257928736   [C
      1875     66395 124492328   [F
        72      3780    272216   [I
         1        16        16   [J
       375        40     15000   [Lcom.github.pemistahl.lingua.internal.TrainingDataLanguageModel$RelativeFrequencies$Entries;
       210        31      6592   [Ljava.lang.Class;
        14       144      2016   [Ljava.lang.ClassValue$Entry;
       611        49     30120   [Ljava.lang.Object;
         1        24        24   [Ljava.lang.String;
         3        53       160   [Ljava.lang.Thread;
         1        32        32   [Ljava.lang.ThreadGroup;
        11        80       880   [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
         9        40       360   [Ljava.lang.invoke.BoundMethodHandle$SpeciesData;
       112        47      5264   [Ljava.lang.invoke.LambdaForm$Name;
         4        32       128   [Ljava.lang.invoke.LambdaFormEditor$Transform;
        29       205      5960   [Ljava.lang.invoke.MethodHandle;
       182        76     13832   [Ljava.lang.ref.SoftReference;
         1        24        24   [Ljava.lang.reflect.Constructor;
        40        43      1736   [Ljava.lang.reflect.Field;
         5        41       208   [Ljava.lang.reflect.Method;
         2        24        48   [Ljava.lang.reflect.TypeVariable;
         3        21        64   [Ljava.security.CodeSigner;
        11        16       176   [Ljava.security.Principal;
        11        24       264   [Ljava.security.ProtectionDomain;
         1        32        32   [Ljava.security.cert.Certificate;
        66       103      6808   [Ljava.util.HashMap$Node;
         2       304       608   [Ljava.util.Hashtable$Entry;
         1        32        32   [Ljava.util.Map$Entry;
        45        80      3600   [Ljava.util.WeakHashMap$Entry;
        31       296      9200   [Ljava.util.concurrent.ConcurrentHashMap$Node;
         1        16        16   [Lsun.instrument.TransformerManager$TransformerInfo;
        11        32       352   [Lsun.nio.fs.NativeBuffer;
         2        16        32   [Lsun.reflect.generics.tree.ClassTypeSignature;
         3        24        72   [Lsun.reflect.generics.tree.FieldTypeSignature;
         2        24        48   [Lsun.reflect.generics.tree.FormalTypeParameter;
         5        16        80   [Lsun.reflect.generics.tree.TypeArgument;
        46        24      1104   [Lsun.security.x509.AVA;
        12        32       392   [Lsun.security.x509.RDN;
         5        24       120   [Z
        75        24      1800   com.github.pemistahl.lingua.api.IsoCode639_1
        75        24      1800   com.github.pemistahl.lingua.api.IsoCode639_3
        75        40      3000   com.github.pemistahl.lingua.api.Language
         1        64        64   com.github.pemistahl.lingua.api.LanguageDetector
        18        24       432   com.github.pemistahl.lingua.internal.Alphabet
       375        32     12000   com.github.pemistahl.lingua.internal.TrainingDataLanguageModel
       375        16      6000   com.github.pemistahl.lingua.internal.TrainingDataLanguageModel$RelativeFrequencies
      1875        24     45000   com.github.pemistahl.lingua.internal.TrainingDataLanguageModel$RelativeFrequencies$Entries
         1        16        16   com.sun.management.internal.PlatformMBeanProviderImpl$$Lambda$31/0x00000008000e1c90
        21        32       672   java.io.File
         1        24        24   java.io.File$PathStatus
        22        56      1232   java.io.FileCleanable
        22        40       880   java.io.FileDescriptor
         2        32        64   java.io.FileInputStream
        20        32       640   java.io.RandomAccessFile
         2        16        32   java.lang.Boolean
        18        24       432   java.lang.Character$UnicodeScript
       135       192     25920   java.lang.Class
         1        16        16   java.lang.Class$$Lambda$44/0x00000008000f7370
        23        64      1472   java.lang.Class$ReflectionData
        14        64       896   java.lang.ClassValue$ClassValueMap
        15        32       480   java.lang.ClassValue$Entry
         1        16        16   java.lang.ClassValue$Identity
         1        24        24   java.lang.ClassValue$Version
        10        16       160   java.lang.Integer
         2        56       112   java.lang.Module
         1        16        16   java.lang.Module$$Lambda$60/0x000000080015d918
        43        16       688   java.lang.Object
         1        16        16   java.lang.ProcessHandleImpl$$Lambda$42/0x00000008000f58c0
         1        24        24   java.lang.ProcessHandleImpl$$Lambda$43/0x00000008000f5ae0
         2        32        64   java.lang.Runtime$Version
        10        24       240   java.lang.RuntimePermission
      2584        24     62016   java.lang.String
         1        16        16   java.lang.System$LoggerFinder$$Lambda$75/0x0000000800186750
        15       368      5520   java.lang.Thread
         3        48       144   java.lang.ThreadGroup
         2        16        32   java.lang.ThreadLocal
        11        24       264   java.lang.ThreadLocal$ThreadLocalMap
        26        32       832   java.lang.ThreadLocal$ThreadLocalMap$Entry
         1        56        56   java.lang.invoke.BoundMethodHandle$Specializer
         1        48        48   java.lang.invoke.BoundMethodHandle$Specializer$Factory
         9        48       432   java.lang.invoke.BoundMethodHandle$SpeciesData
        39        32      1248   java.lang.invoke.BoundMethodHandle$Species_L
         9        40       360   java.lang.invoke.BoundMethodHandle$Species_LJ
        25        40      1000   java.lang.invoke.BoundMethodHandle$Species_LL
         4        48       192   java.lang.invoke.BoundMethodHandle$Species_LLLL
         1        48        48   java.lang.invoke.BoundMethodHandle$Species_LLLLL
         1        56        56   java.lang.invoke.BoundMethodHandle$Species_LLLLLL
         3        56       168   java.lang.invoke.BoundMethodHandle$Species_LLLLLLL
        77        24      1848   java.lang.invoke.ConstantCallSite
        38        32      1216   java.lang.invoke.DirectMethodHandle
        30        40      1200   java.lang.invoke.DirectMethodHandle$Accessor
        20        40       800   java.lang.invoke.DirectMethodHandle$Constructor
        20        24       480   java.lang.invoke.Invokers
       112        48      5376   java.lang.invoke.LambdaForm
         4        32       128   java.lang.invoke.LambdaForm$BasicType
        12        32       384   java.lang.invoke.LambdaForm$Kind
       440        32     14080   java.lang.invoke.LambdaForm$Name
       140        24      3360   java.lang.invoke.LambdaForm$NamedFunction
        36        48      1728   java.lang.invoke.LambdaFormEditor$Transform
       294        48     14112   java.lang.invoke.MemberName
        77        32      2464   java.lang.invoke.MethodHandleNatives$CallSiteContext
       243        40      9720   java.lang.invoke.MethodType
        93        32      2976   java.lang.invoke.MethodTypeForm
       241        24      5784   java.lang.invoke.ResolvedMethodName
         1        16        16   java.lang.management.DefaultPlatformMBeanProvider$5$$Lambda$66/0x000000080014d618
         1        16        16   java.lang.management.ManagementFactory$$Lambda$30/0x00000008000de8a0
         1        16        16   java.lang.management.ManagementFactory$$Lambda$53/0x000000080017cec8
         1        16        16   java.lang.management.ManagementFactory$$Lambda$54/0x000000080017d118
         1        16        16   java.lang.management.ManagementFactory$$Lambda$55/0x000000080017d850
        62        64      3968   java.lang.module.ModuleDescriptor
         1        16        16   java.lang.module.ModuleDescriptor$Builder$$Lambda$59/0x000000080015d6e8
       362        24      8688   java.lang.module.ModuleDescriptor$Exports
         4        24        96   java.lang.module.ModuleDescriptor$Opens
        60        24      1440   java.lang.module.ModuleDescriptor$Provides
       132        32      4224   java.lang.module.ModuleDescriptor$Requires
         2        24        48   java.lang.module.ModuleDescriptor$Requires$Modifier
         1        32        32   java.lang.module.ModuleDescriptor$Version
         1        16        16   java.lang.ref.Cleaner
         1       376       376   java.lang.ref.Finalizer$FinalizerThread
         1       368       368   java.lang.ref.Reference$ReferenceHandler
        46        32      1472   java.lang.ref.ReferenceQueue
        47        16       752   java.lang.ref.ReferenceQueue$Lock
         1        32        32   java.lang.ref.ReferenceQueue$Null
       125        40      5000   java.lang.ref.SoftReference
         1        72        72   java.lang.reflect.Constructor
       258        72     18576   java.lang.reflect.Field
        20        88      1760   java.lang.reflect.Method
         1        16        16   java.lang.reflect.Proxy$$Lambda$57/0x000000080015ce68
         1        16        16   java.lang.reflect.Proxy$ProxyBuilder$$Lambda$58/0x000000080015d2a8
         1        16        16   java.lang.reflect.ProxyGenerator$$Lambda$63/0x000000080015e4e0
         1        16        16   java.lang.reflect.ProxyGenerator$$Lambda$64/0x000000080015e720
        15        40       600   java.math.BigInteger
        75        80      6000   java.net.URI
        45        64      2880   java.net.URL
         2        64       128   java.nio.DirectByteBuffer
         1        56        56   java.nio.HeapByteBuffer
        18        40       720   java.security.AccessControlContext
        10        24       240   java.security.BasicPermissionCollection
         1        24        24   java.security.CodeSigner
        10        40       400   java.security.CodeSource
        10        24       240   java.security.Permissions
        11        40       440   java.security.ProtectionDomain
        11        16       176   java.security.ProtectionDomain$Key
        10        16       160   java.security.SecureClassLoader$CodeSourceKey
         1        24        24   java.security.Timestamp
         6        32       192   java.security.cert.PolicyQualifierInfo
        32        24       768   java.util.ArrayDeque
        34        24       816   java.util.ArrayList
         1        16        16   java.util.Collections$EmptyList
         1        16        16   java.util.Collections$EmptySet
        42        24      1008   java.util.Collections$SetFromMap
        74        16      1184   java.util.Collections$SingletonSet
         5        32       160   java.util.Collections$SynchronizedMap
         1        32        32   java.util.Collections$UnmodifiableMap
         2        24        48   java.util.Collections$UnmodifiableRandomAccessList
         1        16        16   java.util.Collections$UnmodifiableSet
        11        24       264   java.util.Date
        12        48       576   java.util.HashMap
         2        16        32   java.util.HashMap$KeySet
       141        32      4512   java.util.HashMap$Node
         2        16        32   java.util.HashSet
         2        48        96   java.util.Hashtable
        44        32      1408   java.util.Hashtable$Entry
        11        40       440   java.util.IdentityHashMap
        11        16       176   java.util.IdentityHashMap$KeySet
        68        24      1632   java.util.ImmutableCollections$List12
        32        24       768   java.util.ImmutableCollections$ListN
       201        24      4824   java.util.ImmutableCollections$Set12
       114        24      2736   java.util.ImmutableCollections$SetN
        61        56      3416   java.util.LinkedHashMap
       539        40     21560   java.util.LinkedHashMap$Entry
        32        16       512   java.util.LinkedHashMap$LinkedEntrySet
         1        16        16   java.util.LinkedHashMap$LinkedKeySet
         6        16        96   java.util.LinkedHashSet
         1        16        16   java.util.Optional
         5        48       240   java.util.TreeMap
        38        40      1520   java.util.TreeMap$Entry
         4        32       128   java.util.Vector
        31        48      1488   java.util.WeakHashMap
        14        40       560   java.util.WeakHashMap$Entry
        31        16       496   java.util.WeakHashMap$KeySet
        38        64      2432   java.util.concurrent.ConcurrentHashMap
      1302        32     41664   java.util.concurrent.ConcurrentHashMap$Node
         3        16        48   java.util.concurrent.ConcurrentHashMap$ValuesView
         1        24        24   java.util.concurrent.Executors$DefaultThreadFactory
         1        48        48   java.util.concurrent.LinkedBlockingQueue
         1        24        24   java.util.concurrent.LinkedBlockingQueue$Node
         1        32        32   java.util.concurrent.SynchronousQueue
         1        16        16   java.util.concurrent.SynchronousQueue$TransferStack
         1        32        32   java.util.concurrent.SynchronousQueue$TransferStack$SNode
         2        72       144   java.util.concurrent.ThreadPoolExecutor
         1        16        16   java.util.concurrent.ThreadPoolExecutor$AbortPolicy
        11        48       528   java.util.concurrent.ThreadPoolExecutor$Worker
         4        16        64   java.util.concurrent.atomic.AtomicInteger
        10        32       320   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode
         4        24        96   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject
         1        32        32   java.util.concurrent.locks.AbstractQueuedSynchronizer$ExclusiveNode
         4        16        64   java.util.concurrent.locks.ReentrantLock
         4        32       128   java.util.concurrent.locks.ReentrantLock$NonfairSync
         1        16        16   java.util.function.Function$$Lambda$67/0x000000080014d858
        49        16       784   java.util.jar.Attributes
        16        24       384   java.util.jar.Attributes$Name
        20        56      1120   java.util.jar.JarFile
        31       104      3224   java.util.jar.JarFile$JarFileEntry
         1        88        88   java.util.jar.JarVerifier
         6        24       144   java.util.jar.Manifest
         1        16        16   java.util.logging.Level$$Lambda$73/0x00000008001855c0
         1        16        16   java.util.logging.Level$KnownLevel$$Lambda$71/0x0000000800180638
         1        16        16   java.util.logging.Level$KnownLevel$$Lambda$72/0x0000000800180878
         1        16        16   java.util.logging.Level$KnownLevel$$Lambda$74/0x0000000800185800
         1        16        16   java.util.regex.CharPredicates$$Lambda$27/0x00000008000da2a0
         1        16        16   java.util.regex.CharPredicates$$Lambda$39/0x00000008000f32a8
         1        16        16   java.util.stream.Collectors$$Lambda$36/0x00000008000e9038
         1        16        16   java.util.stream.Collectors$$Lambda$37/0x00000008000e9258
         1        16        16   java.util.stream.Collectors$$Lambda$38/0x00000008000e9488
         1        16        16   java.util.stream.Collectors$$Lambda$46/0x00000008000f77d8
         1        16        16   java.util.stream.Collectors$$Lambda$47/0x00000008000f7a08
         1        16        16   java.util.stream.Collectors$$Lambda$48/0x00000008000f7c50
         1        16        16   java.util.stream.Collectors$$Lambda$68/0x000000080014da98
         1        16        16   java.util.stream.Collectors$$Lambda$70/0x000000080014def0
        39        64      2496   java.util.zip.Inflater
        39        24       936   java.util.zip.Inflater$InflaterZStreamRef
         1        32        32   java.util.zip.ZipCoder$UTF8ZipCoder
        31        32       992   java.util.zip.ZipFile$CleanableResource
        20        80      1600   java.util.zip.ZipFile$Source
        20        24       480   java.util.zip.ZipFile$Source$Key
        10        16       160   javax.security.auth.x500.X500Principal
         1        16        16   jdk.internal.jimage.ImageBufferCache$1
         1       104       104   jdk.internal.loader.ClassLoaders$AppClassLoader
         1       104       104   jdk.internal.loader.ClassLoaders$BootClassLoader
         1       104       104   jdk.internal.loader.ClassLoaders$PlatformClassLoader
         1        40        40   jdk.internal.loader.URLClassPath
         1        24        24   jdk.internal.loader.URLClassPath$FileLoader
        20        48       960   jdk.internal.loader.URLClassPath$JarLoader
         1       376       376   jdk.internal.misc.InnocuousThread
         1        16        16   jdk.internal.misc.TerminatingThreadLocal$1
         1        24        24   jdk.internal.module.ModuleHashes
        62        56      3472   jdk.internal.module.ModuleReferenceImpl
         1        16        16   jdk.internal.module.ModuleTarget
        62        24      1488   jdk.internal.module.SystemModuleFinders$2
        60        16       960   jdk.internal.module.SystemModuleFinders$3
        62        24      1488   jdk.internal.module.SystemModuleFinders$SystemModuleReader
         1        16        16   jdk.internal.perf.Perf
         1        24        24   jdk.internal.perf.Perf$CleanerAction
         1        24        24   jdk.internal.ref.CleanerImpl
         1        40        40   jdk.internal.ref.CleanerImpl$CleanerCleanable
       160        48      7680   jdk.internal.ref.CleanerImpl$PhantomCleanableRef
         1        16        16   jdk.jfr.internal.dcmd.DCmdCheck$$Lambda$79/0x000000080018d4d8
         1        16        16   jdk.jfr.internal.dcmd.DCmdDump$$Lambda$78/0x000000080018d2b8
         1        16        16   jdk.jfr.internal.dcmd.DCmdStart$$Lambda$77/0x000000080018d098
         1        16        16   jdk.jfr.internal.dcmd.DCmdStop$$Lambda$76/0x000000080018a688
         1        16        16   kotlin.collections.CollectionsKt___CollectionsKt$asSequence$$inlined$Sequence$1
         1        16        16   kotlin.collections.EmptyMap
         1        24        24   org.openjdk.jol.info.AbstractGraphWalker$ReferenceFieldsClassValue
         1        96        96   org.openjdk.jol.vm.HotspotUnsafe
         1        24        24   org.openjdk.jol.vm.HotspotUnsafe$1
         1        48        48   org.openjdk.jol.vm.HotspotUnsafe$Sizes
         1        32        32   sun.instrument.InstrumentationImpl
         1        24        24   sun.instrument.TransformerManager
         4        56       224   sun.invoke.util.Wrapper
         1        16        16   sun.management.ManagementFactoryHelper$$Lambda$80/0x0000000800191028
         1        16        16   sun.misc.Unsafe
         1        16        16   sun.net.www.protocol.file.Handler
         1        16        16   sun.net.www.protocol.jar.Handler
         1        16        16   sun.net.www.protocol.jar.JarFileFactory
        11        72       792   sun.net.www.protocol.jar.URLJarFile
         1        16        16   sun.net.www.protocol.jrt.Handler
         1        24        24   sun.nio.cs.UTF_8
         1        32        32   sun.nio.fs.MacOSXFileSystem
         1        16        16   sun.nio.fs.MacOSXFileSystemProvider
        11        32       352   sun.nio.fs.NativeBuffer
        11        24       264   sun.nio.fs.NativeBuffer$Deallocator
         1        16        16   sun.nio.fs.NativeBuffers$1
        20       128      2560   sun.nio.fs.UnixFileAttributes
        20        16       320   sun.nio.fs.UnixFileAttributes$UnixAsBasicFileAttributes
         2        32        64   sun.nio.fs.UnixFileKey
        32        32      1024   sun.nio.fs.UnixPath
         2        24        48   sun.reflect.generics.factory.CoreReflectionFactory
         3        32        96   sun.reflect.generics.reflectiveObjects.TypeVariableImpl
         2        32        64   sun.reflect.generics.repository.ClassRepository
         2        24        48   sun.reflect.generics.scope.ClassScope
         2        24        48   sun.reflect.generics.tree.ClassSignature
         5        16        80   sun.reflect.generics.tree.ClassTypeSignature
         3        24        72   sun.reflect.generics.tree.FormalTypeParameter
         5        24       120   sun.reflect.generics.tree.SimpleClassTypeSignature
         2        24        48   sun.security.provider.certpath.X509CertPath
         5        48       240   sun.security.rsa.RSAPublicKeyImpl
         1        32        32   sun.security.rsa.RSAUtil$KeyType
         5        24       120   sun.security.util.BitArray
        46        40      1840   sun.security.util.DerInputStream
        46        32      1472   sun.security.util.DerValue
        10        24       240   sun.security.util.LazyCodeSourcePermissionCollection
        88        32      2816   sun.security.util.ObjectIdentifier
        46        24      1104   sun.security.x509.AVA
         7        24       168   sun.security.x509.AccessDescription
        15        24       360   sun.security.x509.AlgorithmId
         4        32       128   sun.security.x509.AuthorityInfoAccessExtension
         5        40       200   sun.security.x509.AuthorityKeyIdentifierExtension
         5        32       160   sun.security.x509.BasicConstraintsExtension
         4        32       128   sun.security.x509.CRLDistributionPointsExtension
         5        16        80   sun.security.x509.CertificateAlgorithmId
         5        24       120   sun.security.x509.CertificateExtensions
         4        32       128   sun.security.x509.CertificatePoliciesExtension
         6        16        96   sun.security.x509.CertificatePolicyId
         5        16        80   sun.security.x509.CertificateSerialNumber
         5        24       120   sun.security.x509.CertificateValidity
         5        16        80   sun.security.x509.CertificateVersion
         5        16        80   sun.security.x509.CertificateX509Key
        13        16       208   sun.security.x509.DNSName
         6        32       192   sun.security.x509.DistributionPoint
         4        32       128   sun.security.x509.ExtendedKeyUsageExtension
        15        16       240   sun.security.x509.GeneralName
         8        16       128   sun.security.x509.GeneralNames
        10        16       160   sun.security.x509.KeyIdentifier
         5        32       160   sun.security.x509.KeyUsageExtension
         6        24       144   sun.security.x509.PolicyInformation
        46        24      1104   sun.security.x509.RDN
         5        16        80   sun.security.x509.SerialNumber
         2        32        64   sun.security.x509.SubjectAlternativeNameExtension
         5        32       160   sun.security.x509.SubjectKeyIdentifierExtension
        13        32       416   sun.security.x509.URIName
        12        48       576   sun.security.x509.X500Name
         5        80       400   sun.security.x509.X509CertImpl
         5        56       280   sun.security.x509.X509CertInfo
         1        16        16   sun.tools.attach.HotSpotVirtualMachine$$Lambda$41/0x0000000800140720
     22192           385479016   (total)
fvasco commented 2 years ago

@sigpwned, @Marcono1234, thank you for your feedback, these are really appreciated.

I changed my code to try to gain some performance, I fear that this version requires a bit more memory. Please feedback.

fvasco commented 2 years ago

I redesigned the frequencies' data in an n-ary search tree, the memory requirement is roughly 1.2GB. In my tests, this version is faster than the 1.1.1 without any breaking change.

Version v1.1.1

Benchmark                Mode  Cnt  Score   Error  Units
DetectBenchmark.detect  thrpt    4  0.051 ± 0.002  ops/s

Version v1.1.1-5-g2d80e26`

Benchmark                Mode  Cnt  Score   Error  Units
DetectBenchmark.detect  thrpt    4  0.063 ± 0.007  ops/s
fvasco commented 2 years ago

I reduced memory requirement to 440MB, this version should be faster than original both in preload and execution time.

fvasco commented 2 years ago

Thank you for your consideration, @pemistahl. I went forward on the main branch https://github.com/fvasco/lingua

I resumed the language cache with a Thread-safe one and removed the Thread Pool because it does not really improve performances but requires special care, like an explicit destructor. In my experience, that Thread Pool doesn't really improve performance on the server-side.

The opened huge problem remains the data set loading, I suppose that using the ugly Java serialization format instead of JSON may really improve the startup time without any downside (all memory optimization can be performed at learning time). Plus, Lingua can drop the kotlinx.serialization dependency.

Please contact me if you need further info.

pemistahl commented 2 years ago

Hi @fvasco and @sigpwned, I'm very sorry for my late response. Thanks a lot for your hard work to reduce the memory footprint of the library. I've been busy doing other things recently, so I could not evaluate your changes yet.

Despite all your efforts, I want to be honest with you. Especially, your code in the file TrainingDataLanguageModel.kt is hard to grasp. I'm sure that I don't want to maintain this code in the future. So I most probably won't merge it, at least not in this current state. However, as soon as I find the time, I will evaluate your changes in more detail and perhaps I will change my mind.

The JVM has always been very memory-hungry, this is no surprise. If you have so special requirements that memory is an issue for you, you should perhaps switch to a more efficient language such as Go or Rust. You've certainly discovered my implementations of Lingua in these two other languages. But if you are bound to the JVM, perhaps it's better to use your modified version of the library exclusively in your own projects.

If you plan to continue working on this PR, can you please use the branch v1.2.0-wip instead of main to work against? This branch contains some other important changes that are not yet included in main.

Thanks again for your work. I really appreciate it.

fvasco commented 2 years ago

Thank you for the suggestions, @pemistahl, I merged your commits and the @Marcono1234's proposals, I improved initialization time and reduce the memory requirement for all models below 335MB.

Accuracy is comparable to your version, so I will consider using my fork.

By the way, TrainingDataLanguageModel implements a simple search tree. The root is the empty ngram, the first level contains unigrams, and so on... I generated some special implementations (a node with 1/2/3/../7 children) to improve performance. MutableNode is a temporary class to save data before optimizing the tree.

How much memory we can recover using the Rust or the Go implementations? What about their performance?