[Feature Request]: Compressing large folders,For example, compressing the entire D: drive

rikyoz / bit7z

A C++ static library offering a clean and simple interface to the 7-zip shared libraries.

https://rikyoz.github.io/bit7z

Mozilla Public License 2.0

602 stars 110 forks source link

[Feature Request]: Compressing large folders,For example, compressing the entire D: drive #223

Open yuzhengjun1204 opened 1 week ago

yuzhengjun1204 commented 1 week ago

Feature description

When compressing a very large folder, calling the compress function can take a long time and it will exit when an error is output. Is it possible to divide a large folder into n smaller files and process the compress function individually?

Additional context

No response

Code of Conduct

[X] By submitting this issue, I agree to follow bit7z's Code of Conduct

rikyoz commented 1 week ago

Hi!

Is it possible to divide a large folder into n smaller files and process the compress function individually?

Bit7z cannot automatically separate the files into subsets for you. If you manage to do it in your code, one possible way to achieve what you need is to create an initial archive from a known subset of the files (e.g., via BitArchiveWriter or BitFileCompressor), and then incrementally update the newly created archive with the remaining known subsets of files (e.g., via BitArchiveEditor, or using BitFileCompressor with setUpdateMode(UpdateMode::Append)).

yuzhengjun1204 commented 1 week ago

Thank you for your response, I will give it a try.

yuzhengjun1204 commented 5 days ago

Hi!

Is it possible to divide a large folder into n smaller files and process the compress function individually?

Bit7z cannot automatically separate the files into subsets for you. If you manage to do it in your code, one possible way to achieve what you need is to create an initial archive from a known subset of the files (e.g., via BitArchiveWriter or BitFileCompressor), and then incrementally update the newly created archive with the remaining known subsets of files (e.g., via BitArchiveEditor, or using BitFileCompressor with setUpdateMode(UpdateMode::Append)).

When I split a large folder into individual files for processing, I find the speed to be very slow. Is this normal, and what can I do to improve the compression speed? my code is like this`

Bit7zLibrary lib{"7z.dll"};
BitFileCompressor compressor{lib, BitFormat::Zip};
// .....
for(auto& file_info: file_list){
    compressor.setCompressionLevel(BitCompressionLevel::Normal);
    std::map<std::string, std::string> temp_file_map;
    temp_file_map[file_info.absolute_file_path_] = file_info.relative_file_path_;
    compressor.compress(temp_file_map, dst_file_path_.toStdString());
    compressor.setOverwriteMode(OverwriteMode::None);
    compressor.setUpdateMode(UpdateMode::Append);
}

yuzhengjun1204 commented 5 days ago

A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).

BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// .....
compressor.setOverwriteMode(OverwriteMode::Overwrite);
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path);

rikyoz commented 5 days ago

A few notes first:

You can call setCompressionLevel, setOverwriteMode and setUpdateMode once outside the loop, as the values they set will be used for all compression operations.
- Unless, of course, you need to change the settings for each compression operation, which doesn't seem to be the case from your code.
The default overwrite mode of BitFileCompressor is already None, so you don't need to set it.

As for the performance of the code:

I guess that file_info refers to single files, not directories. If this is the case, you can use compressor.compressFile(file_info.absolute_file_path_, dst_file_path_.toStdString(), file_info.relative_file_path_).
- This has the advantage of not having to allocate a std::map on each iteration.
- The allocation of the map is probably not the bottleneck here, though, so I don't expect any significant improvement.
Each compression operation that needs to append files to an existing archive is increasingly expensive in terms of performance:
- Updating an archive basically works by creating a new temporary archive, copying the data from the original archive, appending the newly compressed data, and then overwriting the original archive with the temporary one.
- This is unavoidable due to how 7-Zip's interface works (and 7-Zip GUI/CLI does the same, as far as I know).

So yes, the slowness is normal, especially for such a large archive and because you're doing a lot of single file append compressions. The only way to improve performance is to reduce the number of compressions, which is exactly what you have done here:

A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).
BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// .....
compressor.setOverwriteMode(OverwriteMode::Overwrite);
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path);

In my first reply, I suggested the "update" approach because it was the only way to achieve what you originally described in this issue. But I don't think it's the best approach, performance-wise, as you also noted.

Ideally, only one compression should be performed. If this is not possible, the fewer compressions the better. I would avoid doing single file updates, but rather compress "clusters" of files together.

In the issue, you mentioned that your program terminated because of an error during the compression process. This could be due to an underlying problem with the files being compressed, so I would check that before taking the "update" approach.

yuzhengjun1204 commented 5 days ago

A few notes first:

You can call setCompressionLevel, setOverwriteMode and setUpdateMode once outside the loop, as the values they set will be used for all compression operations.

Unless, of course, you need to change the settings for each compression operation, which doesn't seem to be the case from your code.

The default overwrite mode of BitFileCompressor is already None, so you don't need to set it.

As for the performance of the code:

I guess that file_info refers to single files, not directories. If this is the case, you can use compressor.compressFile(file_info.absolute_file_path_, dst_file_path_.toStdString(), file_info.relative_file_path_).

This has the advantage of not having to allocate a std::map on each iteration.

The allocation of the map is probably not the bottleneck here, though, so I don't expect any significant improvement.

Each compression operation that needs to append files to an existing archive is increasingly expensive in terms of performance:

Updating an archive basically works by creating a new temporary archive, copying the data from the original archive, appending the newly compressed data, and then overwriting the original archive with the temporary one.

This is unavoidable due to how 7-Zip's interface works (and 7-Zip GUI/CLI does the same, as far as I know).

So yes, the slowness is normal, especially for such a large archive and because you're doing a lot of single file append compressions. The only way to improve performance is to reduce the number of compressions, which is exactly what you have done here:
A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).
BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// .....
compressor.setOverwriteMode(OverwriteMode::Overwrite);
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path);
In my first reply, I suggested the "update" approach because it was the only way to achieve what you originally described in this issue. But I don't think it's the best approach, performance-wise, as you also noted.

Ideally, only one compression should be performed. If this is not possible, the fewer compressions the better. I would avoid doing single file updates, but rather compress "clusters" of files together.

In the issue, you mentioned that your program terminated because of an error during the compression process. This could be due to an underlying problem with the files being compressed, so I would check that before taking the "update" approach.

When compressing, if there is an exception, it will throw an error message and stop compressing. Is there a way to throw an error message but not stop compressing? I hope the compression can continue and skip those erroneous files.

yuzhengjun1204 commented 4 days ago

When compressing a particularly large file, about 200GB, a stack overflow may occur, causing an abnormal exit. Are there any good ways to handle this?`

BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// ...
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path_.toStdString());

rikyoz commented 4 days ago

When compressing, if there is an exception, it will throw an error message and stop compressing. Is there a way to throw an error message but not stop compressing? I hope the compression can continue and skip those erroneous files.

Unfortunately, no. But I think it could be a useful feature, so I'll definitely add it to the library.

Here is what you can do for now:

Try the whole compression once, check which files are failing (BitException provides a failedFiles() method which can help) and retry the whole compression without those files; or
Use incremental compression of clusters of files, as explained above; or
If the files you need to compress are always the same, you could check the reason why they cannot be compressed and avoid compressing them preemptively.

When compressing a particularly large file, about 200GB, a stack overflow may occur, causing an abnormal exit. Are there any good ways to handle this?

Um, that's weird, I'll try to replicate the problem. What OS, compiler and configuration flags are you using? Is the 7-Zip GUI/CLI successfully compressing the same file or is it failing?

yuzhengjun1204 commented 3 days ago

7-zip GUI is ok. I can use try, catch(...), to catch the stack overflow exception and getlasterror=5, which is basically reproducible during my tests. I hope to avoid stack overflow exceptions. 7-zip: 24.05 OS: Win10, MSVC2019, x86. flags: BIT7Z_AUTO_FORMAT、BIT7Z_REGEX_MATCHING、BIT7Z_AUTO_PREFIX_LONG_PATHS、BIT7Z_PATH_SANITIZATION

rikyoz commented 2 days ago

Thanks for the further details.

I've already tried a similar configuration: MSVC 2022, x86, same flags, 7-Zip 23.01, single ~110GB file to be compressed as a zip file. In my tests, I could not reproduce the issue, and the compression finished flawlessly. I'll try to use a configuration closer to yours and see if I can reproduce the issue.

yuzhengjun1204 commented 2 days ago

Thanks for the further details.

I've already tried a similar configuration: MSVC 2022, x86, same flags, 7-Zip 23.01, single ~110GB file to be compressed as a zip file. In my tests, I could not reproduce the issue, and the compression finished flawlessly. I'll try to use a configuration closer to yours and see if I can reproduce the issue.

Test code. After running compressor.compress, process crashed.

#include <iostream>
#include "include/bit7z/bitextractor.hpp"
#include "include/bit7z/bitcompressor.hpp"
#include "include/bit7z/bitfilecompressor.hpp"
#include "include/bit7z/bitexception.hpp"
#include "include/bit7z/bitfileextractor.hpp"
using namespace  bit7z;
void compressfile() {
    try { // bit7z classes can throw BitException objects
        using namespace bit7z;
        Bit7zLibrary lib{ "7z.dll" };
        BitFileCompressor compressor{ lib, BitFormat::Zip };
        compressor.setProgressCallback([&](uint64_t in_size) -> bool {
            std::cout << "setProgressCallback: " << in_size << std::endl;
            return true;
            });
        compressor.setFileCallback([&](std::string file_path) {
            std::cout << "filecallback" << file_path << std::endl;
            });
        std::vector<std::string> file_list;
        file_list.push_back("d:\\");
        compressor.compress(file_list, "d://d.zip");
    }
    catch (const bit7z::BitException& ex) {
        std::cout << "Hello World!" << ex.code() << ex.what();
    }
}
int main()
{
    std::cout << "Hello World!\n";
    compressfile();
}

yuzhengjun1204 commented 2 days ago

This code can catch error:3. Debug output bit7z_test.exe (Win32): Loaded "C:\Windows\System32\DriverStore\FileRepository\iigd_dch.inf_amd64_4be767c332df1d04\igdumdim32.dll". Exception at 0x773C590F (ntdll.dll) in bit7z_test.exe: 0xC00000FD: Stack overflow (parameters: 0x00000001, 0x01042FD0).

void compressfile() {
    try { // bit7z classes can throw BitException objects
        using namespace bit7z;
        Bit7zLibrary lib{ "7z.dll" };
        BitFileCompressor compressor{ lib, BitFormat::Zip };
        compressor.setProgressCallback([&](uint64_t in_size) -> bool {
            std::cout << "setProgressCallback: " << in_size << std::endl;
            return true;
            });
        compressor.setFileCallback([&](std::string file_path) {
            std::cout << "filecallback" << file_path << std::endl;
            });
        std::vector<std::string> file_list;
        file_list.push_back("d:\\");
        compressor.compress(file_list, "d://d.zip");
    }
    catch (const bit7z::BitException& ex) {
        std::cout << "error:1" << ex.code() << ex.what();
    }
    catch (const std::exception& e) {
        std::cout << "error:2" << e.what();
    }
    catch (...) {
        std::cout << "error:3:";
    }
}

yuzhengjun1204 commented 2 days ago

It seems that the issue is caused by the recursive call in this code.

void FilesystemIndexer::listDirectoryItems( vector< unique_ptr< GenericInputItem > >& result,
                                            bool recursive,
                                            const fs::path& prefix ) {
    fs::path path = mDirItem.filesystemPath();
    if ( !prefix.empty() ) {
        path = path / prefix;
    }
    const bool includeRootPath = mFilter.empty() ||
                                 !mDirItem.filesystemPath().has_parent_path() ||
                                 mDirItem.inArchivePath().filename() != mDirItem.filesystemName();
    const bool shouldIncludeMatchedItems = mPolicy == FilterPolicy::Include;
    std::error_code error;
    for ( const auto& currentEntry : fs::directory_iterator( path, error ) ) {
        auto searchPath = includeRootPath ? mDirItem.inArchivePath() : fs::path{};
        if ( !prefix.empty() ) {
            searchPath = searchPath.empty() ? prefix : searchPath / prefix;
        }

        const FilesystemItem currentItem{ currentEntry, searchPath, mSymlinkPolicy };
        /* An item matches if:
         *  - Its name matches the wildcard pattern, and
         *  - Either is a file, or we are interested also to include folders in the index.
         *
         * Note: The boolean expression uses short-circuiting to optimize the evaluation. */
        const bool itemMatches = ( !mOnlyFiles || !currentItem.isDir() ) &&
                                 fsutil::wildcard_match( mFilter, currentItem.name() );
        if ( itemMatches == shouldIncludeMatchedItems ) {
            result.emplace_back( std::make_unique< FilesystemItem >( currentItem ) );
        }

        if ( currentItem.isDir() && ( recursive || ( itemMatches == shouldIncludeMatchedItems ) ) ) {
            //currentItem is a directory, and we must list it only if:
            // > indexing is done recursively
            // > indexing is not recursive, but the directory name matched the filter.
            const fs::path nextDir = prefix.empty() ?
                                     currentItem.filesystemName() : prefix / currentItem.filesystemName();
            listDirectoryItems( result, true, nextDir );
        }
    }
}

rikyoz commented 2 days ago

Ah, I see! I was investigating a file size issue, but the problem is actually caused by trying to recursively index a directory structure that is too deeply nested. I'll try rewriting the indexing code to not use recursion. Thanks for the additional details, they really helped!