Recursive Extractor is a Cross-Platform .NET Standard 2.0 Library and Command Line Program for parsing archive files and disk images, including nested archives and disk images.
7zip+ | ar | bzip2 |
deb | dmg** | gzip |
iso | rar^ | tar |
vhd | vhdx | vmdk |
wim* | xzip | zip+ |
dotnet tool install -g Microsoft.CST.RecursiveExtractor.Cli
This adds RecursiveExtractor
to your path so you can run it directly from your shell.
Basic usage is: RecursiveExtractor --input archive.ext --output outputDirectory
Recursive Extractor is available on NuGet as Microsoft.CST.RecursiveExtractor. Recursive Extractor targets netstandard2.0+ and the latest .NET, currently .NET 6.0, .NET 7.0 and .NET 8.0.
The most basic usage is to enumerate through all the files in the archive provided and do something with their contents as a Stream.
using Microsoft.CST.RecursiveExtractor;
var path = "path/to/file";
var extractor = new Extractor();
foreach(var file in extractor.Extract(path))
{
doSomething(file.Content); //Do Something with the file contents (a Stream)
}
RecursiveExtractor protects against ZipSlip, Quines, and Zip Bombs.
Calls to Extract will throw an OverflowException
when a Quine or Zip bomb is detected and a TimeOutException
if EnableTiming
is set and the specified time period has elapsed before completion.
Otherwise, invalid files found while crawling will emit a logger message and be skipped. You can also enable ExtractSelfOnFail
to return the original archive file on an extraction failure.
You should not iterate the Enumeration returned from the Extract
and ExtractAsync
interfaces multiple times, if you need to do so, convert the Enumeration to an in memory collection first.
If you want to enumerate the output with parallelization you should use a batching mechanism, for example:
var extractedEnumeration = Extract(fileEntry, opts);
using var enumerator = extractedEnumeration.GetEnumerator();
ConcurrentBag<FileEntry> entryBatch = new();
bool moreAvailable = enumerator.MoveNext();
while (moreAvailable)
{
entryBatch = new();
for (int i = 0; i < BatchSize; i++)
{
entryBatch.Add(enumerator.Current);
moreAvailable = enumerator.MoveNext();
if (!moreAvailable)
{
break;
}
}
if (entryBatch.Count == 0)
{
break;
}
// Run your parallel processing on the batch
Parallel.ForEach(entryBatch, new ParallelOptions() { CancellationToken = cts.Token }, entry =>
{
// Do something with each FileEntry
}
}
If you are working with a very large archive or in particularly constrained environment you can reduce memory and file handle usage for the Content streams in each FileEntry by disposing as you iterate.
var results = extractor.Extract(path);
foreach(var file in results)
{
using var theStream = file.Content;
// Do something with the stream.
_ = theStream.ReadByte();
// The stream is disposed here by the using statement
}
If you have any issues or feature requests (for example, supporting other formats) you can open a new Issue.
If you are having trouble parsing a specific archive of one of the supported formats, it is helpful if you can include an sample archive with your report that demonstrates the issue.
Recursive Extractor aims to provide a unified interface to extract arbitrary archives and relies on a number of libraries to parse the archives.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.