Closed fvasco closed 2 years ago
Thanks for your PR, Francesco. I will evaluate your changes soon and let you know whether I want to merge them or not. Please be a little patient. My todo list is growing faster than I expected (which is a good thing :).
Happy to hear that, @pemistahl. Please consider that this version preserves JSON data files, which are really inefficient. Should be really better to use a binary data source (which loads really faster and requires less memory, already described here).
Hello! I realize I'm dropping in from nowhere, but I'm working on a project that involves language detection and saw this work was evolving in real time, had a little bit of spare time on a Sunday afternoon, and thought I might be able to add some value. So here I am!
I was a software performance analyst in a past life, and helping think through changes like this was an important part of that role. When optimizing programs, you quickly learn that the first thing you need is data. It turns out that the Pareto principle applies to software performance: 90% of time is spent in 10% of code. It also turns out that humans -- even experts who have made a career out of optimizing programs all day long -- are very bad at guessing which 10% matters! So if we don't measure, it's really hard to know if the work we do is having the desired effect.
To that end, I put together this very simple benchmark of lingua performance using JMH. There's a little bit of test harness to unpack in that repository, but most of the work is here:
@Fork(value = 3)
@OutputTimeUnit(TimeUnit.SECONDS)
@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class DetectBenchmark {
public LanguageDetector detector;
/**
* Contains exactly 1MB of "random" data sampled from Twitter streaming API. Visually confirmed to
* be multi-language.
*/
public List<String> lines;
@Setup
public void setupDetectBenchmark() throws IOException {
detector = LanguageDetectorBuilder.fromAllLanguages().withPreloadedLanguageModels().build();
try (BufferedReader in = new BufferedReader(new InputStreamReader(
new GZIPInputStream(Resources.getResource("tweets.txt.gz").openStream()),
StandardCharsets.UTF_8))) {
lines = in.lines().collect(toList());
}
}
/*
* @formatter:off
*
* As of 2022-03-27:
*
* Benchmark Mode Cnt Score Error Units
* EmojiJavaBenchmark.tweets thrpt 15 105.708 ± 0.401 ops/s
*
* @formatter:on
*/
@Benchmark
public void detect(Blackhole blackhole) {
Set<Language> languages = EnumSet.noneOf(Language.class);
for (String line : lines)
languages.add(detector.detectLanguageOf(line));
blackhole.consume(languages);
}
}
Essentially, the benchmark initializes by creating a language detector for all languages with preloaded models and loading about 20K tweets pulled from social media. The benchmark itself then detects language for each tweet, over and over, as fast as it can. The JMH framework then reports performance characteristics for the benchmark. I ran the benchmark on the current code, and then on the proposed change. To be fully transparent, I did have to make a couple of (cosmetic) changes to the PR to get the benchmark to work, and you can review those changes here. (@fvasco, if you got any notifications about that work, then I apologize for the distraction, it is entirely because I have fat fingers.)
Here is the performance of the current code:
Benchmark Mode Cnt Score Error Units
DetectBenchmark.detect thrpt 9 0.197 ± 0.005 ops/s
DetectBenchmark.detect:·gc.alloc.rate thrpt 9 1355.646 ± 43.312 MB/sec
DetectBenchmark.detect:·gc.alloc.rate.norm thrpt 9 7525093881.037 ± 366615.998 B/op
Here is the performance of the current code with the proposed change:
Benchmark Mode Cnt Score Error Units
DetectBenchmark.detect thrpt 9 0.151 ± 0.002 ops/s
DetectBenchmark.detect:·gc.alloc.rate thrpt 9 928.336 ± 88.560 MB/sec
DetectBenchmark.detect:·gc.alloc.rate.norm thrpt 9 6688520282.222 ± 659955119.790 B/op
The net results here are:
Software optimization is hard. Everything is connected to everything else, often in unpredictable ways. In this case, it looks like the change has made a nice reduction in memory usage as well as an unfortunate reduction in wall clock performance. I suspect that someone could quickly figure out what is going on with some light profiling in a nice, free tool like visualvm and make some quick performance improvements. I suspect the tool could also inform any memory usage planning, too. I would be happy to help, if that's useful.
I do apologize for popping in here from nowhere. I hope it's not rude. I certainly did not pop in to criticize work (to the contrary, this is good stuff, @fvasco!), or to tell anyone what to do or not to do (@pemistahl, it's your show!). However, it's rare that I find my background to be relevant to work going on in real time, and I couldn't resist the urge to poke my head in. If this is unwelcome, then please do tell me to buzz off, and I shall!
Hi, @sigpwned, thank you for your engagement. I am a random visitor like you.
I suspect that you measured the memory allocation rate for each run, my goal is to reduce the memory requirements, unfortunately, this library requires too much RAM, so we choose another one.
I expected a performance drop, but 23% is a sensible drop, so I have decided to rework this PR to try to improve performance.
If your environment is still ready, can evaluate this PR, again?
Thank you in advance.
I suspect that you measured the memory allocation rate for each run, my goal is to reduce the memory requirements, unfortunately, this library requires too much RAM, so we choose another one.
Ah! You are correct, and that is an important difference. I think I can help there too. Hopefully I'll come back in a bit with some more information on that.
If your environment is still ready, can evaluate this PR, again?
I would be very happy to! If it's useful, I'd also be happy to spend a little time in a profiler to see if I can't identify where we're losing performance.
If performance (whether wall clock or memory) is becoming a major focus, then I think it might also make sense to integrate some basic benchmarks into PR checks. @pemistahl, does that sound interesting or useful? If so, I'd be happy to take a swing at it in a separate PR!
OK! Using jol, I was able to pull object graph memory footprints. Short version here, long version below.
For an object:
LanguageDetector detector =
LanguageDetectorBuilder.fromAllLanguages().withPreloadedLanguageModels().build();
The object graph memory footprints are as follows:
v1.1.1 in maven -- 2,281,277,024 B ≈ 2.28GB this PR -- 385,479,016 B ≈ 385MB
So substantially reduced memory footprint, roughly 380MB as advertised. Huge improvement, @fvasco!
I've updated the code in the benchmark repo in case anyone is curious. The memory usage is a little ticklish to run because it wants elevated permissions, but it should at least be pretty clear what's going on.
Full dumps are included below, in case anyone is curious.
The original code has the following footprint:
com.github.pemistahl.lingua.api.LanguageDetector@f0f2775d footprint:
COUNT AVG SUM DESCRIPTION
31118328 25 795811136 [B
375 1311825 491934544 [D
72 3780 272160 [I
1 16 16 [J
210 31 6592 [Ljava.lang.Class;
14 144 2016 [Ljava.lang.ClassValue$Entry;
986 249494 246001872 [Ljava.lang.Object;
1 24 24 [Ljava.lang.String;
3 53 160 [Ljava.lang.Thread;
1 32 32 [Ljava.lang.ThreadGroup;
11 80 880 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
9 40 360 [Ljava.lang.invoke.BoundMethodHandle$SpeciesData;
112 47 5264 [Ljava.lang.invoke.LambdaForm$Name;
4 32 128 [Ljava.lang.invoke.LambdaFormEditor$Transform;
29 205 5960 [Ljava.lang.invoke.MethodHandle;
182 76 13832 [Ljava.lang.ref.SoftReference;
1 24 24 [Ljava.lang.reflect.Constructor;
40 43 1736 [Ljava.lang.reflect.Field;
5 41 208 [Ljava.lang.reflect.Method;
2 24 48 [Ljava.lang.reflect.TypeVariable;
3 21 64 [Ljava.security.CodeSigner;
12 16 192 [Ljava.security.Principal;
11 24 264 [Ljava.security.ProtectionDomain;
1 32 32 [Ljava.security.cert.Certificate;
67 102 6888 [Ljava.util.HashMap$Node;
2 304 608 [Ljava.util.Hashtable$Entry;
1 32 32 [Ljava.util.Map$Entry;
45 80 3600 [Ljava.util.WeakHashMap$Entry;
33 283 9360 [Ljava.util.concurrent.ConcurrentHashMap$Node;
1 16 16 [Lsun.instrument.TransformerManager$TransformerInfo;
11 32 352 [Lsun.nio.fs.NativeBuffer;
2 16 32 [Lsun.reflect.generics.tree.ClassTypeSignature;
3 24 72 [Lsun.reflect.generics.tree.FieldTypeSignature;
2 24 48 [Lsun.reflect.generics.tree.FormalTypeParameter;
5 16 80 [Lsun.reflect.generics.tree.TypeArgument;
46 24 1104 [Lsun.security.x509.AVA;
12 32 392 [Lsun.security.x509.RDN;
5 24 120 [Z
75 24 1800 com.github.pemistahl.lingua.api.IsoCode639_1
75 24 1800 com.github.pemistahl.lingua.api.IsoCode639_3
75 40 3000 com.github.pemistahl.lingua.api.Language
1 64 64 com.github.pemistahl.lingua.api.LanguageDetector
18 24 432 com.github.pemistahl.lingua.internal.Alphabet
375 32 12000 com.github.pemistahl.lingua.internal.TrainingDataLanguageModel
1 16 16 com.sun.management.internal.PlatformMBeanProviderImpl$$Lambda$31/0x00000008000e1f18
375 72 27000 it.unimi.dsi.fastutil.objects.Object2DoubleOpenHashMap
21 32 672 java.io.File
1 24 24 java.io.File$PathStatus
22 56 1232 java.io.FileCleanable
22 40 880 java.io.FileDescriptor
2 32 64 java.io.FileInputStream
20 32 640 java.io.RandomAccessFile
2 16 32 java.lang.Boolean
18 24 432 java.lang.Character$UnicodeScript
135 192 25920 java.lang.Class
1 16 16 java.lang.Class$$Lambda$44/0x00000008000f79c8
23 64 1472 java.lang.Class$ReflectionData
14 64 896 java.lang.ClassValue$ClassValueMap
15 32 480 java.lang.ClassValue$Entry
1 16 16 java.lang.ClassValue$Identity
1 24 24 java.lang.ClassValue$Version
10 16 160 java.lang.Integer
2 56 112 java.lang.Module
1 16 16 java.lang.Module$$Lambda$60/0x000000080015dba0
43 16 688 java.lang.Object
1 16 16 java.lang.ProcessHandleImpl$$Lambda$42/0x00000008000f5f18
1 24 24 java.lang.ProcessHandleImpl$$Lambda$43/0x00000008000f6138
2 32 64 java.lang.Runtime$Version
11 24 264 java.lang.RuntimePermission
31118003 24 746832072 java.lang.String
1 16 16 java.lang.System$LoggerFinder$$Lambda$75/0x000000080018a580
15 368 5520 java.lang.Thread
3 48 144 java.lang.ThreadGroup
2 16 32 java.lang.ThreadLocal
11 24 264 java.lang.ThreadLocal$ThreadLocalMap
26 32 832 java.lang.ThreadLocal$ThreadLocalMap$Entry
1 56 56 java.lang.invoke.BoundMethodHandle$Specializer
1 48 48 java.lang.invoke.BoundMethodHandle$Specializer$Factory
9 48 432 java.lang.invoke.BoundMethodHandle$SpeciesData
39 32 1248 java.lang.invoke.BoundMethodHandle$Species_L
9 40 360 java.lang.invoke.BoundMethodHandle$Species_LJ
25 40 1000 java.lang.invoke.BoundMethodHandle$Species_LL
4 48 192 java.lang.invoke.BoundMethodHandle$Species_LLLL
1 48 48 java.lang.invoke.BoundMethodHandle$Species_LLLLL
1 56 56 java.lang.invoke.BoundMethodHandle$Species_LLLLLL
3 56 168 java.lang.invoke.BoundMethodHandle$Species_LLLLLLL
77 24 1848 java.lang.invoke.ConstantCallSite
38 32 1216 java.lang.invoke.DirectMethodHandle
30 40 1200 java.lang.invoke.DirectMethodHandle$Accessor
20 40 800 java.lang.invoke.DirectMethodHandle$Constructor
20 24 480 java.lang.invoke.Invokers
112 48 5376 java.lang.invoke.LambdaForm
4 32 128 java.lang.invoke.LambdaForm$BasicType
12 32 384 java.lang.invoke.LambdaForm$Kind
440 32 14080 java.lang.invoke.LambdaForm$Name
140 24 3360 java.lang.invoke.LambdaForm$NamedFunction
36 48 1728 java.lang.invoke.LambdaFormEditor$Transform
294 48 14112 java.lang.invoke.MemberName
77 32 2464 java.lang.invoke.MethodHandleNatives$CallSiteContext
243 40 9720 java.lang.invoke.MethodType
93 32 2976 java.lang.invoke.MethodTypeForm
241 24 5784 java.lang.invoke.ResolvedMethodName
1 16 16 java.lang.management.DefaultPlatformMBeanProvider$5$$Lambda$66/0x0000000800180818
1 16 16 java.lang.management.ManagementFactory$$Lambda$30/0x00000008000deb28
1 16 16 java.lang.management.ManagementFactory$$Lambda$53/0x000000080017d2a8
1 16 16 java.lang.management.ManagementFactory$$Lambda$54/0x000000080017d4f8
1 16 16 java.lang.management.ManagementFactory$$Lambda$55/0x000000080017dc30
62 64 3968 java.lang.module.ModuleDescriptor
1 16 16 java.lang.module.ModuleDescriptor$Builder$$Lambda$59/0x000000080015d970
362 24 8688 java.lang.module.ModuleDescriptor$Exports
4 24 96 java.lang.module.ModuleDescriptor$Opens
60 24 1440 java.lang.module.ModuleDescriptor$Provides
132 32 4224 java.lang.module.ModuleDescriptor$Requires
2 24 48 java.lang.module.ModuleDescriptor$Requires$Modifier
1 32 32 java.lang.module.ModuleDescriptor$Version
1 16 16 java.lang.ref.Cleaner
1 376 376 java.lang.ref.Finalizer$FinalizerThread
1 368 368 java.lang.ref.Reference$ReferenceHandler
46 32 1472 java.lang.ref.ReferenceQueue
47 16 752 java.lang.ref.ReferenceQueue$Lock
1 32 32 java.lang.ref.ReferenceQueue$Null
126 40 5040 java.lang.ref.SoftReference
1 72 72 java.lang.reflect.Constructor
258 72 18576 java.lang.reflect.Field
20 88 1760 java.lang.reflect.Method
1 16 16 java.lang.reflect.Proxy$$Lambda$57/0x000000080015d0f0
1 16 16 java.lang.reflect.Proxy$ProxyBuilder$$Lambda$58/0x000000080015d530
1 16 16 java.lang.reflect.ProxyGenerator$$Lambda$63/0x000000080015e768
1 16 16 java.lang.reflect.ProxyGenerator$$Lambda$64/0x000000080015e9a8
15 40 600 java.math.BigInteger
75 80 6000 java.net.URI
45 64 2880 java.net.URL
2 64 128 java.nio.DirectByteBuffer
1 56 56 java.nio.HeapByteBuffer
18 40 720 java.security.AccessControlContext
11 24 264 java.security.BasicPermissionCollection
1 24 24 java.security.CodeSigner
11 40 440 java.security.CodeSource
11 24 264 java.security.Permissions
12 40 480 java.security.ProtectionDomain
12 16 192 java.security.ProtectionDomain$Key
11 16 176 java.security.SecureClassLoader$CodeSourceKey
1 24 24 java.security.Timestamp
6 32 192 java.security.cert.PolicyQualifierInfo
32 24 768 java.util.ArrayDeque
34 24 816 java.util.ArrayList
1 16 16 java.util.Collections$EmptyList
1 16 16 java.util.Collections$EmptySet
42 24 1008 java.util.Collections$SetFromMap
74 16 1184 java.util.Collections$SingletonSet
5 32 160 java.util.Collections$SynchronizedMap
1 32 32 java.util.Collections$UnmodifiableMap
2 24 48 java.util.Collections$UnmodifiableRandomAccessList
1 16 16 java.util.Collections$UnmodifiableSet
11 24 264 java.util.Date
13 48 624 java.util.HashMap
2 16 32 java.util.HashMap$KeySet
141 32 4512 java.util.HashMap$Node
2 16 32 java.util.HashSet
2 48 96 java.util.Hashtable
44 32 1408 java.util.Hashtable$Entry
11 40 440 java.util.IdentityHashMap
11 16 176 java.util.IdentityHashMap$KeySet
68 24 1632 java.util.ImmutableCollections$List12
32 24 768 java.util.ImmutableCollections$ListN
201 24 4824 java.util.ImmutableCollections$Set12
114 24 2736 java.util.ImmutableCollections$SetN
62 56 3472 java.util.LinkedHashMap
551 40 22040 java.util.LinkedHashMap$Entry
32 16 512 java.util.LinkedHashMap$LinkedEntrySet
1 16 16 java.util.LinkedHashMap$LinkedKeySet
6 16 96 java.util.LinkedHashSet
1 16 16 java.util.Optional
5 48 240 java.util.TreeMap
38 40 1520 java.util.TreeMap$Entry
4 32 128 java.util.Vector
31 48 1488 java.util.WeakHashMap
14 40 560 java.util.WeakHashMap$Entry
31 16 496 java.util.WeakHashMap$KeySet
40 64 2560 java.util.concurrent.ConcurrentHashMap
1306 32 41792 java.util.concurrent.ConcurrentHashMap$Node
3 16 48 java.util.concurrent.ConcurrentHashMap$ValuesView
1 24 24 java.util.concurrent.Executors$DefaultThreadFactory
1 48 48 java.util.concurrent.LinkedBlockingQueue
1 24 24 java.util.concurrent.LinkedBlockingQueue$Node
1 32 32 java.util.concurrent.SynchronousQueue
1 16 16 java.util.concurrent.SynchronousQueue$TransferStack
1 32 32 java.util.concurrent.SynchronousQueue$TransferStack$SNode
2 72 144 java.util.concurrent.ThreadPoolExecutor
1 16 16 java.util.concurrent.ThreadPoolExecutor$AbortPolicy
11 48 528 java.util.concurrent.ThreadPoolExecutor$Worker
4 16 64 java.util.concurrent.atomic.AtomicInteger
10 32 320 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode
4 24 96 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject
4 16 64 java.util.concurrent.locks.ReentrantLock
4 32 128 java.util.concurrent.locks.ReentrantLock$NonfairSync
1 16 16 java.util.function.Function$$Lambda$67/0x0000000800180a58
50 16 800 java.util.jar.Attributes
26 24 624 java.util.jar.Attributes$Name
20 56 1120 java.util.jar.JarFile
31 104 3224 java.util.jar.JarFile$JarFileEntry
1 88 88 java.util.jar.JarVerifier
7 24 168 java.util.jar.Manifest
1 16 16 java.util.logging.Level$$Lambda$73/0x00000008001893f0
1 16 16 java.util.logging.Level$KnownLevel$$Lambda$71/0x0000000800184468
1 16 16 java.util.logging.Level$KnownLevel$$Lambda$72/0x00000008001846a8
1 16 16 java.util.logging.Level$KnownLevel$$Lambda$74/0x0000000800189630
1 16 16 java.util.regex.CharPredicates$$Lambda$27/0x00000008000da528
1 16 16 java.util.regex.CharPredicates$$Lambda$39/0x00000008000f3328
1 16 16 java.util.stream.Collectors$$Lambda$36/0x00000008000e92c0
1 16 16 java.util.stream.Collectors$$Lambda$37/0x00000008000e94e0
1 16 16 java.util.stream.Collectors$$Lambda$38/0x00000008000e9710
1 16 16 java.util.stream.Collectors$$Lambda$46/0x00000008000f7e30
1 16 16 java.util.stream.Collectors$$Lambda$47/0x00000008000f8060
1 16 16 java.util.stream.Collectors$$Lambda$48/0x00000008000f82a8
1 16 16 java.util.stream.Collectors$$Lambda$68/0x0000000800180c98
1 16 16 java.util.stream.Collectors$$Lambda$70/0x00000008001810f0
37 64 2368 java.util.zip.Inflater
37 24 888 java.util.zip.Inflater$InflaterZStreamRef
1 32 32 java.util.zip.ZipCoder$UTF8ZipCoder
31 32 992 java.util.zip.ZipFile$CleanableResource
20 80 1600 java.util.zip.ZipFile$Source
20 24 480 java.util.zip.ZipFile$Source$Key
10 16 160 javax.security.auth.x500.X500Principal
1 16 16 jdk.internal.jimage.ImageBufferCache$1
1 104 104 jdk.internal.loader.ClassLoaders$AppClassLoader
1 104 104 jdk.internal.loader.ClassLoaders$BootClassLoader
1 104 104 jdk.internal.loader.ClassLoaders$PlatformClassLoader
1 40 40 jdk.internal.loader.URLClassPath
1 24 24 jdk.internal.loader.URLClassPath$FileLoader
20 48 960 jdk.internal.loader.URLClassPath$JarLoader
1 376 376 jdk.internal.misc.InnocuousThread
1 16 16 jdk.internal.misc.TerminatingThreadLocal$1
1 24 24 jdk.internal.module.ModuleHashes
62 56 3472 jdk.internal.module.ModuleReferenceImpl
1 16 16 jdk.internal.module.ModuleTarget
62 24 1488 jdk.internal.module.SystemModuleFinders$2
60 16 960 jdk.internal.module.SystemModuleFinders$3
62 24 1488 jdk.internal.module.SystemModuleFinders$SystemModuleReader
1 16 16 jdk.internal.perf.Perf
1 24 24 jdk.internal.perf.Perf$CleanerAction
1 24 24 jdk.internal.ref.CleanerImpl
1 40 40 jdk.internal.ref.CleanerImpl$CleanerCleanable
158 48 7584 jdk.internal.ref.CleanerImpl$PhantomCleanableRef
1 16 16 jdk.jfr.internal.dcmd.DCmdCheck$$Lambda$79/0x0000000800191308
1 16 16 jdk.jfr.internal.dcmd.DCmdDump$$Lambda$78/0x00000008001910e8
1 16 16 jdk.jfr.internal.dcmd.DCmdStart$$Lambda$77/0x0000000800190ec8
1 16 16 jdk.jfr.internal.dcmd.DCmdStop$$Lambda$76/0x000000080018e4b8
1 16 16 kotlin.collections.CollectionsKt___CollectionsKt$asSequence$$inlined$Sequence$1
1 16 16 kotlin.collections.EmptyMap
1 24 24 org.openjdk.jol.info.AbstractGraphWalker$ReferenceFieldsClassValue
1 96 96 org.openjdk.jol.vm.HotspotUnsafe
1 24 24 org.openjdk.jol.vm.HotspotUnsafe$1
1 48 48 org.openjdk.jol.vm.HotspotUnsafe$Sizes
1 32 32 sun.instrument.InstrumentationImpl
1 24 24 sun.instrument.TransformerManager
4 56 224 sun.invoke.util.Wrapper
1 16 16 sun.management.ManagementFactoryHelper$$Lambda$80/0x0000000800194e58
1 16 16 sun.misc.Unsafe
1 16 16 sun.net.www.protocol.file.Handler
1 16 16 sun.net.www.protocol.jar.Handler
1 16 16 sun.net.www.protocol.jar.JarFileFactory
11 72 792 sun.net.www.protocol.jar.URLJarFile
1 16 16 sun.net.www.protocol.jrt.Handler
1 24 24 sun.nio.cs.UTF_8
1 32 32 sun.nio.fs.MacOSXFileSystem
1 16 16 sun.nio.fs.MacOSXFileSystemProvider
11 32 352 sun.nio.fs.NativeBuffer
11 24 264 sun.nio.fs.NativeBuffer$Deallocator
1 16 16 sun.nio.fs.NativeBuffers$1
20 128 2560 sun.nio.fs.UnixFileAttributes
20 16 320 sun.nio.fs.UnixFileAttributes$UnixAsBasicFileAttributes
2 32 64 sun.nio.fs.UnixFileKey
32 32 1024 sun.nio.fs.UnixPath
2 24 48 sun.reflect.generics.factory.CoreReflectionFactory
3 32 96 sun.reflect.generics.reflectiveObjects.TypeVariableImpl
2 32 64 sun.reflect.generics.repository.ClassRepository
2 24 48 sun.reflect.generics.scope.ClassScope
2 24 48 sun.reflect.generics.tree.ClassSignature
5 16 80 sun.reflect.generics.tree.ClassTypeSignature
3 24 72 sun.reflect.generics.tree.FormalTypeParameter
5 24 120 sun.reflect.generics.tree.SimpleClassTypeSignature
2 24 48 sun.security.provider.certpath.X509CertPath
5 48 240 sun.security.rsa.RSAPublicKeyImpl
1 32 32 sun.security.rsa.RSAUtil$KeyType
5 24 120 sun.security.util.BitArray
46 40 1840 sun.security.util.DerInputStream
46 32 1472 sun.security.util.DerValue
11 24 264 sun.security.util.LazyCodeSourcePermissionCollection
88 32 2816 sun.security.util.ObjectIdentifier
46 24 1104 sun.security.x509.AVA
7 24 168 sun.security.x509.AccessDescription
15 24 360 sun.security.x509.AlgorithmId
4 32 128 sun.security.x509.AuthorityInfoAccessExtension
5 40 200 sun.security.x509.AuthorityKeyIdentifierExtension
5 32 160 sun.security.x509.BasicConstraintsExtension
4 32 128 sun.security.x509.CRLDistributionPointsExtension
5 16 80 sun.security.x509.CertificateAlgorithmId
5 24 120 sun.security.x509.CertificateExtensions
4 32 128 sun.security.x509.CertificatePoliciesExtension
6 16 96 sun.security.x509.CertificatePolicyId
5 16 80 sun.security.x509.CertificateSerialNumber
5 24 120 sun.security.x509.CertificateValidity
5 16 80 sun.security.x509.CertificateVersion
5 16 80 sun.security.x509.CertificateX509Key
13 16 208 sun.security.x509.DNSName
6 32 192 sun.security.x509.DistributionPoint
4 32 128 sun.security.x509.ExtendedKeyUsageExtension
15 16 240 sun.security.x509.GeneralName
8 16 128 sun.security.x509.GeneralNames
10 16 160 sun.security.x509.KeyIdentifier
5 32 160 sun.security.x509.KeyUsageExtension
6 24 144 sun.security.x509.PolicyInformation
46 24 1104 sun.security.x509.RDN
5 16 80 sun.security.x509.SerialNumber
2 32 64 sun.security.x509.SubjectAlternativeNameExtension
5 32 160 sun.security.x509.SubjectKeyIdentifierExtension
13 32 416 sun.security.x509.URIName
12 48 576 sun.security.x509.X500Name
5 80 400 sun.security.x509.X509CertImpl
5 56 280 sun.security.x509.X509CertInfo
1 16 16 sun.tools.attach.HotSpotVirtualMachine$$Lambda$41/0x0000000800142d50
62247818 2281277024 (total)
The proposed code change has the following footprint:
com.github.pemistahl.lingua.api.LanguageDetector@f0f2775d footprint:
COUNT AVG SUM DESCRIPTION
2909 766 2230680 [B
1875 137561 257928736 [C
1875 66395 124492328 [F
72 3780 272216 [I
1 16 16 [J
375 40 15000 [Lcom.github.pemistahl.lingua.internal.TrainingDataLanguageModel$RelativeFrequencies$Entries;
210 31 6592 [Ljava.lang.Class;
14 144 2016 [Ljava.lang.ClassValue$Entry;
611 49 30120 [Ljava.lang.Object;
1 24 24 [Ljava.lang.String;
3 53 160 [Ljava.lang.Thread;
1 32 32 [Ljava.lang.ThreadGroup;
11 80 880 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
9 40 360 [Ljava.lang.invoke.BoundMethodHandle$SpeciesData;
112 47 5264 [Ljava.lang.invoke.LambdaForm$Name;
4 32 128 [Ljava.lang.invoke.LambdaFormEditor$Transform;
29 205 5960 [Ljava.lang.invoke.MethodHandle;
182 76 13832 [Ljava.lang.ref.SoftReference;
1 24 24 [Ljava.lang.reflect.Constructor;
40 43 1736 [Ljava.lang.reflect.Field;
5 41 208 [Ljava.lang.reflect.Method;
2 24 48 [Ljava.lang.reflect.TypeVariable;
3 21 64 [Ljava.security.CodeSigner;
11 16 176 [Ljava.security.Principal;
11 24 264 [Ljava.security.ProtectionDomain;
1 32 32 [Ljava.security.cert.Certificate;
66 103 6808 [Ljava.util.HashMap$Node;
2 304 608 [Ljava.util.Hashtable$Entry;
1 32 32 [Ljava.util.Map$Entry;
45 80 3600 [Ljava.util.WeakHashMap$Entry;
31 296 9200 [Ljava.util.concurrent.ConcurrentHashMap$Node;
1 16 16 [Lsun.instrument.TransformerManager$TransformerInfo;
11 32 352 [Lsun.nio.fs.NativeBuffer;
2 16 32 [Lsun.reflect.generics.tree.ClassTypeSignature;
3 24 72 [Lsun.reflect.generics.tree.FieldTypeSignature;
2 24 48 [Lsun.reflect.generics.tree.FormalTypeParameter;
5 16 80 [Lsun.reflect.generics.tree.TypeArgument;
46 24 1104 [Lsun.security.x509.AVA;
12 32 392 [Lsun.security.x509.RDN;
5 24 120 [Z
75 24 1800 com.github.pemistahl.lingua.api.IsoCode639_1
75 24 1800 com.github.pemistahl.lingua.api.IsoCode639_3
75 40 3000 com.github.pemistahl.lingua.api.Language
1 64 64 com.github.pemistahl.lingua.api.LanguageDetector
18 24 432 com.github.pemistahl.lingua.internal.Alphabet
375 32 12000 com.github.pemistahl.lingua.internal.TrainingDataLanguageModel
375 16 6000 com.github.pemistahl.lingua.internal.TrainingDataLanguageModel$RelativeFrequencies
1875 24 45000 com.github.pemistahl.lingua.internal.TrainingDataLanguageModel$RelativeFrequencies$Entries
1 16 16 com.sun.management.internal.PlatformMBeanProviderImpl$$Lambda$31/0x00000008000e1c90
21 32 672 java.io.File
1 24 24 java.io.File$PathStatus
22 56 1232 java.io.FileCleanable
22 40 880 java.io.FileDescriptor
2 32 64 java.io.FileInputStream
20 32 640 java.io.RandomAccessFile
2 16 32 java.lang.Boolean
18 24 432 java.lang.Character$UnicodeScript
135 192 25920 java.lang.Class
1 16 16 java.lang.Class$$Lambda$44/0x00000008000f7370
23 64 1472 java.lang.Class$ReflectionData
14 64 896 java.lang.ClassValue$ClassValueMap
15 32 480 java.lang.ClassValue$Entry
1 16 16 java.lang.ClassValue$Identity
1 24 24 java.lang.ClassValue$Version
10 16 160 java.lang.Integer
2 56 112 java.lang.Module
1 16 16 java.lang.Module$$Lambda$60/0x000000080015d918
43 16 688 java.lang.Object
1 16 16 java.lang.ProcessHandleImpl$$Lambda$42/0x00000008000f58c0
1 24 24 java.lang.ProcessHandleImpl$$Lambda$43/0x00000008000f5ae0
2 32 64 java.lang.Runtime$Version
10 24 240 java.lang.RuntimePermission
2584 24 62016 java.lang.String
1 16 16 java.lang.System$LoggerFinder$$Lambda$75/0x0000000800186750
15 368 5520 java.lang.Thread
3 48 144 java.lang.ThreadGroup
2 16 32 java.lang.ThreadLocal
11 24 264 java.lang.ThreadLocal$ThreadLocalMap
26 32 832 java.lang.ThreadLocal$ThreadLocalMap$Entry
1 56 56 java.lang.invoke.BoundMethodHandle$Specializer
1 48 48 java.lang.invoke.BoundMethodHandle$Specializer$Factory
9 48 432 java.lang.invoke.BoundMethodHandle$SpeciesData
39 32 1248 java.lang.invoke.BoundMethodHandle$Species_L
9 40 360 java.lang.invoke.BoundMethodHandle$Species_LJ
25 40 1000 java.lang.invoke.BoundMethodHandle$Species_LL
4 48 192 java.lang.invoke.BoundMethodHandle$Species_LLLL
1 48 48 java.lang.invoke.BoundMethodHandle$Species_LLLLL
1 56 56 java.lang.invoke.BoundMethodHandle$Species_LLLLLL
3 56 168 java.lang.invoke.BoundMethodHandle$Species_LLLLLLL
77 24 1848 java.lang.invoke.ConstantCallSite
38 32 1216 java.lang.invoke.DirectMethodHandle
30 40 1200 java.lang.invoke.DirectMethodHandle$Accessor
20 40 800 java.lang.invoke.DirectMethodHandle$Constructor
20 24 480 java.lang.invoke.Invokers
112 48 5376 java.lang.invoke.LambdaForm
4 32 128 java.lang.invoke.LambdaForm$BasicType
12 32 384 java.lang.invoke.LambdaForm$Kind
440 32 14080 java.lang.invoke.LambdaForm$Name
140 24 3360 java.lang.invoke.LambdaForm$NamedFunction
36 48 1728 java.lang.invoke.LambdaFormEditor$Transform
294 48 14112 java.lang.invoke.MemberName
77 32 2464 java.lang.invoke.MethodHandleNatives$CallSiteContext
243 40 9720 java.lang.invoke.MethodType
93 32 2976 java.lang.invoke.MethodTypeForm
241 24 5784 java.lang.invoke.ResolvedMethodName
1 16 16 java.lang.management.DefaultPlatformMBeanProvider$5$$Lambda$66/0x000000080014d618
1 16 16 java.lang.management.ManagementFactory$$Lambda$30/0x00000008000de8a0
1 16 16 java.lang.management.ManagementFactory$$Lambda$53/0x000000080017cec8
1 16 16 java.lang.management.ManagementFactory$$Lambda$54/0x000000080017d118
1 16 16 java.lang.management.ManagementFactory$$Lambda$55/0x000000080017d850
62 64 3968 java.lang.module.ModuleDescriptor
1 16 16 java.lang.module.ModuleDescriptor$Builder$$Lambda$59/0x000000080015d6e8
362 24 8688 java.lang.module.ModuleDescriptor$Exports
4 24 96 java.lang.module.ModuleDescriptor$Opens
60 24 1440 java.lang.module.ModuleDescriptor$Provides
132 32 4224 java.lang.module.ModuleDescriptor$Requires
2 24 48 java.lang.module.ModuleDescriptor$Requires$Modifier
1 32 32 java.lang.module.ModuleDescriptor$Version
1 16 16 java.lang.ref.Cleaner
1 376 376 java.lang.ref.Finalizer$FinalizerThread
1 368 368 java.lang.ref.Reference$ReferenceHandler
46 32 1472 java.lang.ref.ReferenceQueue
47 16 752 java.lang.ref.ReferenceQueue$Lock
1 32 32 java.lang.ref.ReferenceQueue$Null
125 40 5000 java.lang.ref.SoftReference
1 72 72 java.lang.reflect.Constructor
258 72 18576 java.lang.reflect.Field
20 88 1760 java.lang.reflect.Method
1 16 16 java.lang.reflect.Proxy$$Lambda$57/0x000000080015ce68
1 16 16 java.lang.reflect.Proxy$ProxyBuilder$$Lambda$58/0x000000080015d2a8
1 16 16 java.lang.reflect.ProxyGenerator$$Lambda$63/0x000000080015e4e0
1 16 16 java.lang.reflect.ProxyGenerator$$Lambda$64/0x000000080015e720
15 40 600 java.math.BigInteger
75 80 6000 java.net.URI
45 64 2880 java.net.URL
2 64 128 java.nio.DirectByteBuffer
1 56 56 java.nio.HeapByteBuffer
18 40 720 java.security.AccessControlContext
10 24 240 java.security.BasicPermissionCollection
1 24 24 java.security.CodeSigner
10 40 400 java.security.CodeSource
10 24 240 java.security.Permissions
11 40 440 java.security.ProtectionDomain
11 16 176 java.security.ProtectionDomain$Key
10 16 160 java.security.SecureClassLoader$CodeSourceKey
1 24 24 java.security.Timestamp
6 32 192 java.security.cert.PolicyQualifierInfo
32 24 768 java.util.ArrayDeque
34 24 816 java.util.ArrayList
1 16 16 java.util.Collections$EmptyList
1 16 16 java.util.Collections$EmptySet
42 24 1008 java.util.Collections$SetFromMap
74 16 1184 java.util.Collections$SingletonSet
5 32 160 java.util.Collections$SynchronizedMap
1 32 32 java.util.Collections$UnmodifiableMap
2 24 48 java.util.Collections$UnmodifiableRandomAccessList
1 16 16 java.util.Collections$UnmodifiableSet
11 24 264 java.util.Date
12 48 576 java.util.HashMap
2 16 32 java.util.HashMap$KeySet
141 32 4512 java.util.HashMap$Node
2 16 32 java.util.HashSet
2 48 96 java.util.Hashtable
44 32 1408 java.util.Hashtable$Entry
11 40 440 java.util.IdentityHashMap
11 16 176 java.util.IdentityHashMap$KeySet
68 24 1632 java.util.ImmutableCollections$List12
32 24 768 java.util.ImmutableCollections$ListN
201 24 4824 java.util.ImmutableCollections$Set12
114 24 2736 java.util.ImmutableCollections$SetN
61 56 3416 java.util.LinkedHashMap
539 40 21560 java.util.LinkedHashMap$Entry
32 16 512 java.util.LinkedHashMap$LinkedEntrySet
1 16 16 java.util.LinkedHashMap$LinkedKeySet
6 16 96 java.util.LinkedHashSet
1 16 16 java.util.Optional
5 48 240 java.util.TreeMap
38 40 1520 java.util.TreeMap$Entry
4 32 128 java.util.Vector
31 48 1488 java.util.WeakHashMap
14 40 560 java.util.WeakHashMap$Entry
31 16 496 java.util.WeakHashMap$KeySet
38 64 2432 java.util.concurrent.ConcurrentHashMap
1302 32 41664 java.util.concurrent.ConcurrentHashMap$Node
3 16 48 java.util.concurrent.ConcurrentHashMap$ValuesView
1 24 24 java.util.concurrent.Executors$DefaultThreadFactory
1 48 48 java.util.concurrent.LinkedBlockingQueue
1 24 24 java.util.concurrent.LinkedBlockingQueue$Node
1 32 32 java.util.concurrent.SynchronousQueue
1 16 16 java.util.concurrent.SynchronousQueue$TransferStack
1 32 32 java.util.concurrent.SynchronousQueue$TransferStack$SNode
2 72 144 java.util.concurrent.ThreadPoolExecutor
1 16 16 java.util.concurrent.ThreadPoolExecutor$AbortPolicy
11 48 528 java.util.concurrent.ThreadPoolExecutor$Worker
4 16 64 java.util.concurrent.atomic.AtomicInteger
10 32 320 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode
4 24 96 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject
1 32 32 java.util.concurrent.locks.AbstractQueuedSynchronizer$ExclusiveNode
4 16 64 java.util.concurrent.locks.ReentrantLock
4 32 128 java.util.concurrent.locks.ReentrantLock$NonfairSync
1 16 16 java.util.function.Function$$Lambda$67/0x000000080014d858
49 16 784 java.util.jar.Attributes
16 24 384 java.util.jar.Attributes$Name
20 56 1120 java.util.jar.JarFile
31 104 3224 java.util.jar.JarFile$JarFileEntry
1 88 88 java.util.jar.JarVerifier
6 24 144 java.util.jar.Manifest
1 16 16 java.util.logging.Level$$Lambda$73/0x00000008001855c0
1 16 16 java.util.logging.Level$KnownLevel$$Lambda$71/0x0000000800180638
1 16 16 java.util.logging.Level$KnownLevel$$Lambda$72/0x0000000800180878
1 16 16 java.util.logging.Level$KnownLevel$$Lambda$74/0x0000000800185800
1 16 16 java.util.regex.CharPredicates$$Lambda$27/0x00000008000da2a0
1 16 16 java.util.regex.CharPredicates$$Lambda$39/0x00000008000f32a8
1 16 16 java.util.stream.Collectors$$Lambda$36/0x00000008000e9038
1 16 16 java.util.stream.Collectors$$Lambda$37/0x00000008000e9258
1 16 16 java.util.stream.Collectors$$Lambda$38/0x00000008000e9488
1 16 16 java.util.stream.Collectors$$Lambda$46/0x00000008000f77d8
1 16 16 java.util.stream.Collectors$$Lambda$47/0x00000008000f7a08
1 16 16 java.util.stream.Collectors$$Lambda$48/0x00000008000f7c50
1 16 16 java.util.stream.Collectors$$Lambda$68/0x000000080014da98
1 16 16 java.util.stream.Collectors$$Lambda$70/0x000000080014def0
39 64 2496 java.util.zip.Inflater
39 24 936 java.util.zip.Inflater$InflaterZStreamRef
1 32 32 java.util.zip.ZipCoder$UTF8ZipCoder
31 32 992 java.util.zip.ZipFile$CleanableResource
20 80 1600 java.util.zip.ZipFile$Source
20 24 480 java.util.zip.ZipFile$Source$Key
10 16 160 javax.security.auth.x500.X500Principal
1 16 16 jdk.internal.jimage.ImageBufferCache$1
1 104 104 jdk.internal.loader.ClassLoaders$AppClassLoader
1 104 104 jdk.internal.loader.ClassLoaders$BootClassLoader
1 104 104 jdk.internal.loader.ClassLoaders$PlatformClassLoader
1 40 40 jdk.internal.loader.URLClassPath
1 24 24 jdk.internal.loader.URLClassPath$FileLoader
20 48 960 jdk.internal.loader.URLClassPath$JarLoader
1 376 376 jdk.internal.misc.InnocuousThread
1 16 16 jdk.internal.misc.TerminatingThreadLocal$1
1 24 24 jdk.internal.module.ModuleHashes
62 56 3472 jdk.internal.module.ModuleReferenceImpl
1 16 16 jdk.internal.module.ModuleTarget
62 24 1488 jdk.internal.module.SystemModuleFinders$2
60 16 960 jdk.internal.module.SystemModuleFinders$3
62 24 1488 jdk.internal.module.SystemModuleFinders$SystemModuleReader
1 16 16 jdk.internal.perf.Perf
1 24 24 jdk.internal.perf.Perf$CleanerAction
1 24 24 jdk.internal.ref.CleanerImpl
1 40 40 jdk.internal.ref.CleanerImpl$CleanerCleanable
160 48 7680 jdk.internal.ref.CleanerImpl$PhantomCleanableRef
1 16 16 jdk.jfr.internal.dcmd.DCmdCheck$$Lambda$79/0x000000080018d4d8
1 16 16 jdk.jfr.internal.dcmd.DCmdDump$$Lambda$78/0x000000080018d2b8
1 16 16 jdk.jfr.internal.dcmd.DCmdStart$$Lambda$77/0x000000080018d098
1 16 16 jdk.jfr.internal.dcmd.DCmdStop$$Lambda$76/0x000000080018a688
1 16 16 kotlin.collections.CollectionsKt___CollectionsKt$asSequence$$inlined$Sequence$1
1 16 16 kotlin.collections.EmptyMap
1 24 24 org.openjdk.jol.info.AbstractGraphWalker$ReferenceFieldsClassValue
1 96 96 org.openjdk.jol.vm.HotspotUnsafe
1 24 24 org.openjdk.jol.vm.HotspotUnsafe$1
1 48 48 org.openjdk.jol.vm.HotspotUnsafe$Sizes
1 32 32 sun.instrument.InstrumentationImpl
1 24 24 sun.instrument.TransformerManager
4 56 224 sun.invoke.util.Wrapper
1 16 16 sun.management.ManagementFactoryHelper$$Lambda$80/0x0000000800191028
1 16 16 sun.misc.Unsafe
1 16 16 sun.net.www.protocol.file.Handler
1 16 16 sun.net.www.protocol.jar.Handler
1 16 16 sun.net.www.protocol.jar.JarFileFactory
11 72 792 sun.net.www.protocol.jar.URLJarFile
1 16 16 sun.net.www.protocol.jrt.Handler
1 24 24 sun.nio.cs.UTF_8
1 32 32 sun.nio.fs.MacOSXFileSystem
1 16 16 sun.nio.fs.MacOSXFileSystemProvider
11 32 352 sun.nio.fs.NativeBuffer
11 24 264 sun.nio.fs.NativeBuffer$Deallocator
1 16 16 sun.nio.fs.NativeBuffers$1
20 128 2560 sun.nio.fs.UnixFileAttributes
20 16 320 sun.nio.fs.UnixFileAttributes$UnixAsBasicFileAttributes
2 32 64 sun.nio.fs.UnixFileKey
32 32 1024 sun.nio.fs.UnixPath
2 24 48 sun.reflect.generics.factory.CoreReflectionFactory
3 32 96 sun.reflect.generics.reflectiveObjects.TypeVariableImpl
2 32 64 sun.reflect.generics.repository.ClassRepository
2 24 48 sun.reflect.generics.scope.ClassScope
2 24 48 sun.reflect.generics.tree.ClassSignature
5 16 80 sun.reflect.generics.tree.ClassTypeSignature
3 24 72 sun.reflect.generics.tree.FormalTypeParameter
5 24 120 sun.reflect.generics.tree.SimpleClassTypeSignature
2 24 48 sun.security.provider.certpath.X509CertPath
5 48 240 sun.security.rsa.RSAPublicKeyImpl
1 32 32 sun.security.rsa.RSAUtil$KeyType
5 24 120 sun.security.util.BitArray
46 40 1840 sun.security.util.DerInputStream
46 32 1472 sun.security.util.DerValue
10 24 240 sun.security.util.LazyCodeSourcePermissionCollection
88 32 2816 sun.security.util.ObjectIdentifier
46 24 1104 sun.security.x509.AVA
7 24 168 sun.security.x509.AccessDescription
15 24 360 sun.security.x509.AlgorithmId
4 32 128 sun.security.x509.AuthorityInfoAccessExtension
5 40 200 sun.security.x509.AuthorityKeyIdentifierExtension
5 32 160 sun.security.x509.BasicConstraintsExtension
4 32 128 sun.security.x509.CRLDistributionPointsExtension
5 16 80 sun.security.x509.CertificateAlgorithmId
5 24 120 sun.security.x509.CertificateExtensions
4 32 128 sun.security.x509.CertificatePoliciesExtension
6 16 96 sun.security.x509.CertificatePolicyId
5 16 80 sun.security.x509.CertificateSerialNumber
5 24 120 sun.security.x509.CertificateValidity
5 16 80 sun.security.x509.CertificateVersion
5 16 80 sun.security.x509.CertificateX509Key
13 16 208 sun.security.x509.DNSName
6 32 192 sun.security.x509.DistributionPoint
4 32 128 sun.security.x509.ExtendedKeyUsageExtension
15 16 240 sun.security.x509.GeneralName
8 16 128 sun.security.x509.GeneralNames
10 16 160 sun.security.x509.KeyIdentifier
5 32 160 sun.security.x509.KeyUsageExtension
6 24 144 sun.security.x509.PolicyInformation
46 24 1104 sun.security.x509.RDN
5 16 80 sun.security.x509.SerialNumber
2 32 64 sun.security.x509.SubjectAlternativeNameExtension
5 32 160 sun.security.x509.SubjectKeyIdentifierExtension
13 32 416 sun.security.x509.URIName
12 48 576 sun.security.x509.X500Name
5 80 400 sun.security.x509.X509CertImpl
5 56 280 sun.security.x509.X509CertInfo
1 16 16 sun.tools.attach.HotSpotVirtualMachine$$Lambda$41/0x0000000800140720
22192 385479016 (total)
@sigpwned, @Marcono1234, thank you for your feedback, these are really appreciated.
I changed my code to try to gain some performance, I fear that this version requires a bit more memory. Please feedback.
I redesigned the frequencies' data in an n-ary search tree, the memory requirement is roughly 1.2GB.
In my tests, this version is faster than the 1.1.1
without any breaking change.
Version v1.1.1
Benchmark Mode Cnt Score Error Units
DetectBenchmark.detect thrpt 4 0.051 ± 0.002 ops/s
Version v1.1.1-5-g2d80e26`
Benchmark Mode Cnt Score Error Units
DetectBenchmark.detect thrpt 4 0.063 ± 0.007 ops/s
I reduced memory requirement to 440MB, this version should be faster than original both in preload and execution time.
Thank you for your consideration, @pemistahl. I went forward on the main branch https://github.com/fvasco/lingua
I resumed the language cache with a Thread-safe one and removed the Thread Pool because it does not really improve performances but requires special care, like an explicit destructor. In my experience, that Thread Pool doesn't really improve performance on the server-side.
The opened huge problem remains the data set loading, I suppose that using the ugly Java serialization format instead of JSON may really improve the startup time without any downside (all memory optimization can be performed at learning time). Plus, Lingua can drop the kotlinx.serialization dependency.
Please contact me if you need further info.
Hi @fvasco and @sigpwned, I'm very sorry for my late response. Thanks a lot for your hard work to reduce the memory footprint of the library. I've been busy doing other things recently, so I could not evaluate your changes yet.
Despite all your efforts, I want to be honest with you. Especially, your code in the file TrainingDataLanguageModel.kt
is hard to grasp. I'm sure that I don't want to maintain this code in the future. So I most probably won't merge it, at least not in this current state. However, as soon as I find the time, I will evaluate your changes in more detail and perhaps I will change my mind.
The JVM has always been very memory-hungry, this is no surprise. If you have so special requirements that memory is an issue for you, you should perhaps switch to a more efficient language such as Go or Rust. You've certainly discovered my implementations of Lingua in these two other languages. But if you are bound to the JVM, perhaps it's better to use your modified version of the library exclusively in your own projects.
If you plan to continue working on this PR, can you please use the branch v1.2.0-wip
instead of main
to work against? This branch contains some other important changes that are not yet included in main
.
Thanks again for your work. I really appreciate it.
Thank you for the suggestions, @pemistahl, I merged your commits and the @Marcono1234's proposals, I improved initialization time and reduce the memory requirement for all models below 335MB.
Accuracy is comparable to your version, so I will consider using my fork.
By the way, TrainingDataLanguageModel
implements a simple search tree.
The root is the empty ngram, the first level contains unigrams, and so on...
I generated some special implementations (a node with 1/2/3/../7 children) to improve performance.
MutableNode
is a temporary class to save data before optimizing the tree.
How much memory we can recover using the Rust or the Go implementations? What about their performance?
I changed the runtime memory model, the original JSON is translated to a dense map. This reduces memory requirements at cost of speed (frequencies lookup should be slower). Frequencies are stored as
Float
instead ofDouble
, this introduces an 0.001% error on calculation, and tests are updated accordingly.fastutil
dependency has been removed.All changes are performed in internal classes, so this request is compatible with the 1.1 version and I hope that the merge will be considered soon.