Open yuzhengjun1204 opened 1 week ago
Hi!
Is it possible to divide a large folder into n smaller files and process the compress function individually?
Bit7z cannot automatically separate the files into subsets for you.
If you manage to do it in your code, one possible way to achieve what you need is to create an initial archive from a known subset of the files (e.g., via BitArchiveWriter
or BitFileCompressor
), and then incrementally update the newly created archive with the remaining known subsets of files (e.g., via BitArchiveEditor
, or using BitFileCompressor
with setUpdateMode(UpdateMode::Append)
).
Thank you for your response, I will give it a try.
Hi!
Is it possible to divide a large folder into n smaller files and process the compress function individually?
Bit7z cannot automatically separate the files into subsets for you. If you manage to do it in your code, one possible way to achieve what you need is to create an initial archive from a known subset of the files (e.g., via
BitArchiveWriter
orBitFileCompressor
), and then incrementally update the newly created archive with the remaining known subsets of files (e.g., viaBitArchiveEditor
, or usingBitFileCompressor
withsetUpdateMode(UpdateMode::Append)
).
When I split a large folder into individual files for processing, I find the speed to be very slow. Is this normal, and what can I do to improve the compression speed? my code is like this`
Bit7zLibrary lib{"7z.dll"};
BitFileCompressor compressor{lib, BitFormat::Zip};
// .....
for(auto& file_info: file_list){
compressor.setCompressionLevel(BitCompressionLevel::Normal);
std::map<std::string, std::string> temp_file_map;
temp_file_map[file_info.absolute_file_path_] = file_info.relative_file_path_;
compressor.compress(temp_file_map, dst_file_path_.toStdString());
compressor.setOverwriteMode(OverwriteMode::None);
compressor.setUpdateMode(UpdateMode::Append);
}
A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).
BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// .....
compressor.setOverwriteMode(OverwriteMode::Overwrite);
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path);
A few notes first:
setCompressionLevel
, setOverwriteMode
and setUpdateMode
once outside the loop, as the values they set will be used for all compression operations.
BitFileCompressor
is already None
, so you don't need to set it.As for the performance of the code:
file_info
refers to single files, not directories. If this is the case, you can use compressor.compressFile(file_info.absolute_file_path_, dst_file_path_.toStdString(), file_info.relative_file_path_)
.
std::map
on each iteration. So yes, the slowness is normal, especially for such a large archive and because you're doing a lot of single file append compressions. The only way to improve performance is to reduce the number of compressions, which is exactly what you have done here:
A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).
BitFileCompressor compressor{lib, BitFormat::Zip}; std::vector<std::string> file_list; // ..... compressor.setOverwriteMode(OverwriteMode::Overwrite); compressor.setCompressionLevel(BitCompressionLevel::Normal); compressor.compress(file_list, dst_file_path);
In my first reply, I suggested the "update" approach because it was the only way to achieve what you originally described in this issue. But I don't think it's the best approach, performance-wise, as you also noted.
Ideally, only one compression should be performed. If this is not possible, the fewer compressions the better. I would avoid doing single file updates, but rather compress "clusters" of files together.
In the issue, you mentioned that your program terminated because of an error during the compression process. This could be due to an underlying problem with the files being compressed, so I would check that before taking the "update" approach.
A few notes first:
You can call
setCompressionLevel
,setOverwriteMode
andsetUpdateMode
once outside the loop, as the values they set will be used for all compression operations.
- Unless, of course, you need to change the settings for each compression operation, which doesn't seem to be the case from your code.
- The default overwrite mode of
BitFileCompressor
is alreadyNone
, so you don't need to set it.As for the performance of the code:
I guess that
file_info
refers to single files, not directories. If this is the case, you can usecompressor.compressFile(file_info.absolute_file_path_, dst_file_path_.toStdString(), file_info.relative_file_path_)
.
- This has the advantage of not having to allocate a
std::map
on each iteration.- The allocation of the map is probably not the bottleneck here, though, so I don't expect any significant improvement.
Each compression operation that needs to append files to an existing archive is increasingly expensive in terms of performance:
- Updating an archive basically works by creating a new temporary archive, copying the data from the original archive, appending the newly compressed data, and then overwriting the original archive with the temporary one.
- This is unavoidable due to how 7-Zip's interface works (and 7-Zip GUI/CLI does the same, as far as I know).
So yes, the slowness is normal, especially for such a large archive and because you're doing a lot of single file append compressions. The only way to improve performance is to reduce the number of compressions, which is exactly what you have done here:
A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).
BitFileCompressor compressor{lib, BitFormat::Zip}; std::vector<std::string> file_list; // ..... compressor.setOverwriteMode(OverwriteMode::Overwrite); compressor.setCompressionLevel(BitCompressionLevel::Normal); compressor.compress(file_list, dst_file_path);
In my first reply, I suggested the "update" approach because it was the only way to achieve what you originally described in this issue. But I don't think it's the best approach, performance-wise, as you also noted.
Ideally, only one compression should be performed. If this is not possible, the fewer compressions the better. I would avoid doing single file updates, but rather compress "clusters" of files together.
In the issue, you mentioned that your program terminated because of an error during the compression process. This could be due to an underlying problem with the files being compressed, so I would check that before taking the "update" approach.
When compressing, if there is an exception, it will throw an error message and stop compressing. Is there a way to throw an error message but not stop compressing? I hope the compression can continue and skip those erroneous files.
When compressing a particularly large file, about 200GB, a stack overflow may occur, causing an abnormal exit. Are there any good ways to handle this?`
BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// ...
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path_.toStdString());
`
When compressing, if there is an exception, it will throw an error message and stop compressing. Is there a way to throw an error message but not stop compressing? I hope the compression can continue and skip those erroneous files.
Unfortunately, no. But I think it could be a useful feature, so I'll definitely add it to the library.
Here is what you can do for now:
BitException
provides a failedFiles()
method which can help) and retry the whole compression without those files; orWhen compressing a particularly large file, about 200GB, a stack overflow may occur, causing an abnormal exit. Are there any good ways to handle this?
Um, that's weird, I'll try to replicate the problem. What OS, compiler and configuration flags are you using? Is the 7-Zip GUI/CLI successfully compressing the same file or is it failing?
7-zip GUI is ok. I can use try, catch(...), to catch the stack overflow exception and getlasterror=5, which is basically reproducible during my tests. I hope to avoid stack overflow exceptions. 7-zip: 24.05 OS: Win10, MSVC2019, x86. flags: BIT7Z_AUTO_FORMAT、BIT7Z_REGEX_MATCHING、BIT7Z_AUTO_PREFIX_LONG_PATHS、BIT7Z_PATH_SANITIZATION
Thanks for the further details.
I've already tried a similar configuration: MSVC 2022, x86, same flags, 7-Zip 23.01, single ~110GB file to be compressed as a zip file. In my tests, I could not reproduce the issue, and the compression finished flawlessly. I'll try to use a configuration closer to yours and see if I can reproduce the issue.
Thanks for the further details.
I've already tried a similar configuration: MSVC 2022, x86, same flags, 7-Zip 23.01, single ~110GB file to be compressed as a zip file. In my tests, I could not reproduce the issue, and the compression finished flawlessly. I'll try to use a configuration closer to yours and see if I can reproduce the issue.
Test code. After running compressor.compress, process crashed.
#include <iostream>
#include "include/bit7z/bitextractor.hpp"
#include "include/bit7z/bitcompressor.hpp"
#include "include/bit7z/bitfilecompressor.hpp"
#include "include/bit7z/bitexception.hpp"
#include "include/bit7z/bitfileextractor.hpp"
using namespace bit7z;
void compressfile() {
try { // bit7z classes can throw BitException objects
using namespace bit7z;
Bit7zLibrary lib{ "7z.dll" };
BitFileCompressor compressor{ lib, BitFormat::Zip };
compressor.setProgressCallback([&](uint64_t in_size) -> bool {
std::cout << "setProgressCallback: " << in_size << std::endl;
return true;
});
compressor.setFileCallback([&](std::string file_path) {
std::cout << "filecallback" << file_path << std::endl;
});
std::vector<std::string> file_list;
file_list.push_back("d:\\");
compressor.compress(file_list, "d://d.zip");
}
catch (const bit7z::BitException& ex) {
std::cout << "Hello World!" << ex.code() << ex.what();
}
}
int main()
{
std::cout << "Hello World!\n";
compressfile();
}
This code can catch error:3. Debug output bit7z_test.exe (Win32): Loaded "C:\Windows\System32\DriverStore\FileRepository\iigd_dch.inf_amd64_4be767c332df1d04\igdumdim32.dll". Exception at 0x773C590F (ntdll.dll) in bit7z_test.exe: 0xC00000FD: Stack overflow (parameters: 0x00000001, 0x01042FD0).
void compressfile() {
try { // bit7z classes can throw BitException objects
using namespace bit7z;
Bit7zLibrary lib{ "7z.dll" };
BitFileCompressor compressor{ lib, BitFormat::Zip };
compressor.setProgressCallback([&](uint64_t in_size) -> bool {
std::cout << "setProgressCallback: " << in_size << std::endl;
return true;
});
compressor.setFileCallback([&](std::string file_path) {
std::cout << "filecallback" << file_path << std::endl;
});
std::vector<std::string> file_list;
file_list.push_back("d:\\");
compressor.compress(file_list, "d://d.zip");
}
catch (const bit7z::BitException& ex) {
std::cout << "error:1" << ex.code() << ex.what();
}
catch (const std::exception& e) {
std::cout << "error:2" << e.what();
}
catch (...) {
std::cout << "error:3:";
}
}
It seems that the issue is caused by the recursive call in this code.
void FilesystemIndexer::listDirectoryItems( vector< unique_ptr< GenericInputItem > >& result,
bool recursive,
const fs::path& prefix ) {
fs::path path = mDirItem.filesystemPath();
if ( !prefix.empty() ) {
path = path / prefix;
}
const bool includeRootPath = mFilter.empty() ||
!mDirItem.filesystemPath().has_parent_path() ||
mDirItem.inArchivePath().filename() != mDirItem.filesystemName();
const bool shouldIncludeMatchedItems = mPolicy == FilterPolicy::Include;
std::error_code error;
for ( const auto& currentEntry : fs::directory_iterator( path, error ) ) {
auto searchPath = includeRootPath ? mDirItem.inArchivePath() : fs::path{};
if ( !prefix.empty() ) {
searchPath = searchPath.empty() ? prefix : searchPath / prefix;
}
const FilesystemItem currentItem{ currentEntry, searchPath, mSymlinkPolicy };
/* An item matches if:
* - Its name matches the wildcard pattern, and
* - Either is a file, or we are interested also to include folders in the index.
*
* Note: The boolean expression uses short-circuiting to optimize the evaluation. */
const bool itemMatches = ( !mOnlyFiles || !currentItem.isDir() ) &&
fsutil::wildcard_match( mFilter, currentItem.name() );
if ( itemMatches == shouldIncludeMatchedItems ) {
result.emplace_back( std::make_unique< FilesystemItem >( currentItem ) );
}
if ( currentItem.isDir() && ( recursive || ( itemMatches == shouldIncludeMatchedItems ) ) ) {
//currentItem is a directory, and we must list it only if:
// > indexing is done recursively
// > indexing is not recursive, but the directory name matched the filter.
const fs::path nextDir = prefix.empty() ?
currentItem.filesystemName() : prefix / currentItem.filesystemName();
listDirectoryItems( result, true, nextDir );
}
}
}
Ah, I see! I was investigating a file size issue, but the problem is actually caused by trying to recursively index a directory structure that is too deeply nested. I'll try rewriting the indexing code to not use recursion. Thanks for the additional details, they really helped!
Feature description
When compressing a very large folder, calling the compress function can take a long time and it will exit when an error is output. Is it possible to divide a large folder into n smaller files and process the compress function individually?
Additional context
No response
Code of Conduct