microsoft / FASTER

Fast persistent recoverable log and key-value store + cache, in C# and C++.
https://aka.ms/FASTER
MIT License
6.29k stars 563 forks source link

using fasterkv with string key and string value produces byte[] of humongous size in heap memory #868

Closed krishnakrrish closed 1 year ago

krishnakrrish commented 1 year ago

I have been trying to use Fasterkv RMW operation for a large string set ( more than 10 million strings> but it caused my ram to shoot up when I checked it with the memory profiling tool a lot of string and byte[] are created. I have checked with my code whether I produce it but it isn't. Below is the code I use public class program{ private FasterKV<string, string> store; private LogSettings logSettings; private readonly SimpleFunctions<string, string> funds = new SimpleFunctions<string, string>((a, b) =>a+","+b); public void main() { ConfigureFasterKv(); store.Log.EmptyPageCount = 512; IEnumerable files = File.ReadLines(ProcessorFilePath);// contains 10 million file path in a file var partitioner = Partitioner.Create(files, EnumerablePartitionerOptions.NoBuffering).GetPartitions(5); SemaphoreSlim throttler = new SemaphoreSlim(processorCount); ParallelOptions parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = processorCount // Limit parallelism to number of processors }; Parallel.ForEach(partitioner, parallelOptions, filepart => { throttler.Wait(); // Wait until a slot is available in the throttler try { using (var session = store.NewSession(funcs)) // Create a single session outside the loop { string file, hash; while (filepart.MoveNext()) { file = filepart.Current; hash =CalculateFileMD5Hash(file); if (hash.Equals("Exception")) { /Console.WriteLine($"Cannot get hash for file {file}");/ } else { string output = null; var status = session.RMW(ref hash, ref output); Interlocked.Increment(ref currentFileCount); } } session.CompletePending(wait: true); } } finally { throttler.Release(); // Release the slot } }); store.Reset(); store.Dispose(); logSettings.ObjectLogDevice.Dispose(); logSettings.LogDevice.Dispose(); } private static void ConfigureFasterKv() { logSettings = new LogSettings { /ObjectLogDevice = Devices.CreateLogDevice(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "hlog.obj.log"), preallocateFile: true, deleteOnClose: true, useIoCompletionPort: true), LogDevice = Devices.CreateLogDevice(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "hlog.log"), preallocateFile: true, deleteOnClose: true, useIoCompletionPort: true),/ ObjectLogDevice = Devices.CreateLogDevice(Path.Combine(executablePath, "hlog.obj.log"), deleteOnClose: true), LogDevice = Devices.CreateLogDevice(Path.Combine(executablePath, "hlog.log"), deleteOnClose: true), MemorySizeBits = 25, SegmentSizeBits = 30, PageSizeBits = 15, MutableFraction = 0.5, }; store = new FasterKV<string, string>(1L << 20, logSettings); } } image

TedHartMS commented 1 year ago

Thanks for the report. I ran it with simple strings in 'files' and a stubbed CalculateFileMD5Hash, and I did not see a large number of byte[]. I did see a large number of String, most of which were attributed to the Partitioner. Maybe this will help narrow it down.

krishnakrrish commented 1 year ago

hi, thanks for the insight. I actually used code with a large set of files (5 million) that has unique content in one folder and another folder with a large set of files (5 million) that has the same content as the previous folder. So let's say if 10 million files have 5 million common hash then there will be 5 million key-value pairs with key(hash) and value(file path concat with ','). I have also checked whether the issue is with the Partitioner by commenting out line ->var status = session.RMW(ref hash, ref file); below is the snapshot of memory when commented image

I will also attach the CalculateFileMD5Hash method below public static string CalculateFileMD5Hash(string filePath) { try{ using (var md5 = MD5.Create()) { using (var stream = File.OpenRead(filePath)) { byte[] hashBytes = md5.ComputeHash(stream); return BitConverter.ToString(hashBytes).Replace("-", string.Empty); } } } catch(Execption) { return "Exception"; } }

krishnakrrish commented 1 year ago

apologies , I also made an error in the previous code which is highlighted below public class program{ private FasterKV<string, string> store; private LogSettings logSettings; private readonly SimpleFunctions<string, string> funds = new SimpleFunctions<string, string>((a, b) =>a+","+b); public void main() { ConfigureFasterKv(); store.Log.EmptyPageCount = 512; IEnumerable files = File.ReadLines(ProcessorFilePath);// contains 10 million file path in a file var partitioner = Partitioner.Create(files, EnumerablePartitionerOptions.NoBuffering).GetPartitions(5); SemaphoreSlim throttler = new SemaphoreSlim(processorCount); ParallelOptions parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = processorCount // Limit parallelism to number of processors }; Parallel.ForEach(partitioner, parallelOptions, filepart => { throttler.Wait(); // Wait until a slot is available in the throttler try { using (var session = store.NewSession(funcs)) // Create a single session outside the loop { string file, hash; while (filepart.MoveNext()) { file = filepart.Current; hash =CalculateFileMD5Hash(file); if (hash.Equals("Exception")) { /Console.WriteLine($"Cannot get hash for file {file}");/ }

else

{

var status = session.RMW(ref hash, ref file);

Interlocked.Increment(ref currentFileCount);

}

} session.CompletePending(wait: true); } } finally { throttler.Release(); // Release the slot } }); store.Reset(); store.Dispose(); logSettings.ObjectLogDevice.Dispose(); logSettings.LogDevice.Dispose(); } private static void ConfigureFasterKv() { logSettings = new LogSettings { /ObjectLogDevice = Devices.CreateLogDevice(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "hlog.obj.log"), preallocateFile: true, deleteOnClose: true, useIoCompletionPort: true), LogDevice = Devices.CreateLogDevice(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "hlog.log"), preallocateFile: true, deleteOnClose: true, useIoCompletionPort: true),/ ObjectLogDevice = Devices.CreateLogDevice(Path.Combine(executablePath, "hlog.obj.log"), deleteOnClose: true), LogDevice = Devices.CreateLogDevice(Path.Combine(executablePath, "hlog.log"), deleteOnClose: true), MemorySizeBits = 25, SegmentSizeBits = 30, PageSizeBits = 15, MutableFraction = 0.5, }; store = new FasterKV<string, string>(1L << 20, logSettings); } }

badrishc commented 1 year ago

It does not look like a problem in FASTER. If you can reproduce it stand-alone feel free to re-open and share the repro.

badrishc commented 1 year ago

when using FASTER with objects, you have to provision page size and memory size very carefully. see here for details: https://microsoft.github.io/FASTER/docs/fasterkv-tuning/#log-memory-size-with-c-heap-objects