Open michaelp opened 5 years ago
* I would recommend to revert the commit [de52b97](https://github.com/ironfede/openmcdf/commit/de52b9783864dceea3ca1c65334863ba63ab65b3). In general validation should be an external functionality that allows user to check integrity of the storage layer
One of the reasons for those changes were to avoid some problems with the reader getting stuck in an infinite loop and/or going round in circles before throwing OutOfMemoryException, which doesn't seem like a thing to be left to callers?
Possibly it can be done in a better way though (seems a shame that HashSet doesn't have a means of reserving capacity until the latest .NET runtime versions)
In essence, we are facing the standard fat filesystem consistency check. I don't remember by heart but I am confident that we can come with a solution ( :) this is a standard interview question to find a loop in linked list)
Benchmark project commited. @michaelp do you have the time to try somthing different from Hashset container to check for loop AND keep performance at optimal levels?
@ironfede yes I will try it this weekend. I will try Floyd's algorithm for loop detection
@ironfede
Federico, it would be nice if we could discuss a few general topics like
As you're looking in this area, and #40 is related to the checks anyway: If I debug the Test_CorruptedSectorChain_Doc test, then it gets to
With header.DIFATSectorsNumber set to 268435456, which in #40 was causing it to run out of memory before the existing 'validationCount' check failed. Is it possible in such cases to detect that the value is invalid before even going into the parse loop? (e.g. because there aren't that many sectors available?). (my knowledge of the format is too basic here to be sure of the best way to validate)
@michaelp sorry for delay in responses but I'm currently really busy with main work activities. I'm really interested in your points to improve performance and architectur, especially the last one.. Do you think that a separate branch could work to explore some ideas? @Numpsy , your idea could be useful but would only work for local header corruption and not for sector chain corruption where you can't predict block linking without traversing linked list: this is the more subtle type of corruption and probably the hardest to detect without a lot of allocation in a lookup data structure (=> I'm really interested in a Floyd algorithm implementation as suggested by @michaelp)
@ironfede I am writing some tests and I need a few more evening to come with a solution to the current task. (Just keeping you in the loop). I want to deliver what I've started and then we can talk.
I do realize it has been 5 years, but did anything ever come out of this?
We are currently looking into a performance issue when creating lots of Storages/Streams, and it seems to point to the EnsureUniqueSectorIndex
function.
I have used this in the OpenMcdf.PerfTest project:
class Program
{
static int MAX_STORAGE_COUNT = 500;
static int MAX_STREAM_COUNT = 50;
static String fileName = "PerfLoad.cfs";
static void Main(string[] args)
{
var stopWatch = Stopwatch.StartNew();
File.Delete(fileName);
if (!File.Exists(fileName))
{
CreateFile(fileName);
}
Console.WriteLine($"Create took {stopWatch.Elapsed}");
CompoundFile cf = new CompoundFile(fileName);
stopWatch.Restart();
CFStream s = cf.RootStorage.GetStream("Test1");
Console.WriteLine($"Read took {stopWatch.Elapsed}");
Console.Read();
}
private static void CreateFile(String fn)
{
CompoundFile cf = new CompoundFile();
for (int j = 0; j < MAX_STREAM_COUNT; j++)
{
cf.RootStorage.AddStream("Test" + j.ToString()).SetData(Helpers.GetBuffer(300));
}
for (int i = 0; i < MAX_STORAGE_COUNT; i++)
{
var storage = cf.RootStorage.AddStorage("Storage" + i.ToString());
for (int j = 0; j < MAX_STREAM_COUNT; j++)
{
storage.AddStream("Test" + j.ToString()).SetData(Helpers.GetBuffer(300));
}
}
cf.Save(fileName);
cf.Close();
}
}
Would you accept a PR for an additional Configuration flag in the CFSConfiguration
enum that disables the Unique Sector Index check?
I have done some improvements that help a tiny bit at least...
My test on my machine is like this, while having validation enabled (properly) or disabled:
Create with validation took 00:00:39.3890968
Create without validation took 00:00:30.0826849
That's a nice 30% improvement
I wanted to have a look at the earlier suggested algorithm for detecing loops at some point, but then fell over a number of bugs that needed to be fixed first, and never got to it :-(
I am trying to take a look at the performance degradation in the last few commits. Regardless of the results, I would like to share the initial benchmark that I created.
Introduction There are several test projects in the openmcdf solution that have name/intention to be performance targeting. I might be missing something, but it seems like they are more like an internal tool used by core maintenance team and not an attempt to have a benchmark test suite of any kind. In addition, the lack of the standard format for benchmark results makes it difficult to reproduce results on different machines and difficult to communicate the changes/finding/improvements.
Assumptions:
Goals:
In addition, my initial scenario is very simple.
Here are the starting numbers, right of the master Notice that we read same amount of information and the results are sorted from the fastest to slowest. Last lines are the problematic one.
Quick investigation led to an hypotisis that
private static void EnsureUniqueSectorIndex(int nextSecID, HashSet<int> processedSectors)
in the file CompoundFile.cs is one of the causes of performance slowdown. Quick review of the area shows that HashSet without proper capacity set will be constantly growing and reallocating internal arrays. Which in turn will cause GC pressure and might slow down the performance.To validate my hypothesis i made a single change
And results are way better
Conclusion: