rikyoz / bit7z

A C++ static library offering a clean and simple interface to the 7-zip shared libraries.
https://rikyoz.github.io/bit7z
Mozilla Public License 2.0
611 stars 112 forks source link

custom directory structure compress() creating duplicate files #109

Closed fairybow closed 1 year ago

fairybow commented 1 year ago

Whenever I try to create an archive using the custom directory structure compress(), I end up with two copies of each .txt file in the archive lol.

In the following, dataList is a QVector of struct objects which each contain QString archivePath (path-in-archive / alias path) and std::optional<QString> readPath, where if the latter has no value, the entry is a directory (not needed specifically here, but needed in my project-at-large).

Any and all help would be greatly appreciated!

bit7z::Bit7zLibrary lib{ dllPath.toStdWString() };
bit7z::BitCompressor compressor{ lib, bit7z::BitFormat::SevenZip };
compressor.setCompressionLevel(bit7z::BitCompressionLevel::NONE);

QTemporaryDir temp_dir;
auto temp_path = temp_dir.path();
std::map<std::wstring, std::wstring> files_map;

for (auto& entry : dataList) {

auto archive_path = entry.archivePath.toStdWString();
auto temp_sub_path = (temp_path / entry.archivePath).toStdWString();

if (!entry.readPath.has_value())
std::filesystem::create_directories(temp_sub_path);
else {
std::filesystem::create_directories(std::filesystem::path(temp_sub_path).parent_path());
auto q_temp_sub_path = QString::fromStdWString(temp_sub_path);
QFile::copy(entry.readPath.value(), q_temp_sub_path);
QFile(q_temp_sub_path).setPermissions(QFile::WriteUser);
}

files_map[temp_sub_path] = archive_path;

}

compressor.compress(files_map, writePath.toStdWString());

I can, however, successfully create the archive I want by using compressDirectory() instead, so there's no real problem at the moment. I'm just very curious as to why the former method would be causing duplicate files to be archived.

fairybow commented 1 year ago

As a quick update, I did manage to figure out what was going on (I think).

My files_map here contained an entry for every item to be added, and in this instance it included an entry for the parent directory of the items, like:

temp_dir/base/;
temp_dir/base/subfolder1/textfile1.txt;
temp_dir/base/subfolder1/textfile2.txt;
temp_dir/base/subfolder2/textfile3.txt;
temp_dir/base/subfolder2/textfile4.txt;
temp_dir/base/subfolder3/textfile5.txt;
temp_dir/base/subfolder3/textfile6.txt;

Removing the entry for "base" would fix the issue, and leave 1 copy of each text file, while including it would mean that in addition to copying each file in the subsequent entries to the archive, it would also copied all the files in the archive when initially copying over that first "base" folder, because it wasn't empty.

The way my project will handle updating an archive will probably be via a stored vector of changes-in-progress, like dirs and files renamed or moved or added/deleted, and so consequently, if a user adds an empty dir, then an item for that would be added to the dataList to be archived on project save. So, while simply using compressDirectory() here would work, I may run into this again, when a user adds an empty folder, and then a subsequent file underneath that folder (resulting in 2 entries in the dataList).

So, I think my issue in this instance would be best fixed (and this could possibly be something that I should just implement in my own project's handling of archiving actions using bit7z?) by doing some sort of check on the dataList before archiving to see if any formerly empty dirs have further entries that suggest they are now the proud parents of text files?

Hopefully that helps! I think it's genuinely possible I'm just dumb also and that this normal behavior for archiving lol.

But, if you happen to have any ideas off-hand for how to run that above-mentioned check on the files_map I would be very grateful.

Thanks for all your hard work on this project! It's been really great working with it and being able to so easily use 7zip functionality in my project.

rikyoz commented 1 year ago

Hi! Sorry for the late reply!

My files_map here contained an entry for every item to be added, and in this instance it included an entry for the parent directory of the items

Removing the entry for "base" would fix the issue, and leave 1 copy of each text file, while including it would mean that in addition to copying each file in the subsequent entries to the archive, it would also copied all the files in the archive when initially copying over that first "base" folder, because it wasn't empty.

I think it's genuinely possible I'm just dumb also and that this normal behavior for archiving lol.

Yeah, actually this is the expected behavior of the compressor method that takes a map as the first argument. It just treats all the entries in the map as different elements, it doesn't try to be "smart". Thus, it first adds the directory and the files it contains; then, it adds all the other files in the map (which in your case happen to be the same). Finally, since bit7z v3.x always appends new items, hence the duplicate items.

The way my project will handle updating an archive will probably be via a stored vector of changes-in-progress, like dirs and files renamed or moved or added/deleted, and so consequently, if a user adds an empty dir, then an item for that would be added to the dataList to be archived on project save. So, while simply using compressDirectory() here would work, I may run into this again, when a user adds an empty folder, and then a subsequent file underneath that folder (resulting in 2 entries in the dataList).

So, I think my issue in this instance would be best fixed (and this could possibly be something that I should just implement in my own project's handling of archiving actions using bit7z?) by doing some sort of check on the dataList before archiving to see if any formerly empty dirs have further entries that suggest they are now the proud parents of text files?

But, if you happen to have any ideas off-hand for how to run that above-mentioned check on the files_map I would be very grateful.

If I understand your use case correctly, one possible solution would be to always exclude/ignore non-empty folders in the files_map and always add either empty folders or files. However, in this case, you may have duplicates of the empty folder (if you add it multiple times). Alternatively, you may also check the content of the already existing archive via BitArchiveInfo, and then decide how to update it.

The next version of bit7z (which is in beta) might simplify a bit your use case since it allows you to specify to always overwrite existing files in the updated archive, rather than appending them and hence generate duplicates.

Thanks for all your hard work on this project! It's been really great working with it and being able to so easily use 7zip functionality in my project.

You're welcome! And thank you for using and appreciating it!

fairybow commented 1 year ago

Thanks for the reply! I think the next version sounds like it will definitely be useful in my case, and in the meantime I can figure out a way to conditionally include dirs in the files_map if they don't have children but leave them out otherwise.