openzim / libzim

Reference implementation of the ZIM specification
https://download.openzim.org/release/libzim/
GNU General Public License v2.0
163 stars 47 forks source link

Catch Xapian::DatabaseError in libzim code. #873

Closed mgautierfr closed 3 months ago

mgautierfr commented 3 months ago

Either throw a ZimFileFormatError or directly handle the error (mainly returning default value or ignoring corrupted databases)

Fix #870

kelson42 commented 3 months ago

What about xapian::Error? It should be caught as weel!?

mgautierfr commented 3 months ago

What about xapian::Error? It should be caught as weel!?

If we have Xapian error, it is because of a bug in our code. Not because of an external cause (as a corrupted database). So it should be handle the same way than any other bug in our code : Fix it.

mgautierfr commented 3 months ago

@veloman-yunkan This PR is the last blocking for next release. Do you agree with the following proposition[1]:

Libzim must always throw exception inheriting from zim::ZimFileFormatError for corrupted file/database (or from std::runtime_exception for other errors). We also may return default value in case of corrupted file.

When we return a default value or throw an exception is open to discussion (and can involve in time).

If both @veloman-yunkan and @kelson42 agree, I suggest we merge this PR soon (assuming it actually fix Xapian exception handling to comply with [1]) to make the release and discuss about dubious default value in separated issue.

veloman-yunkan commented 3 months ago

I don't have any objections against merging this PR if it eliminates some roadblock. My immediate concern is that these changes - though quite simple - are untested (at least, not by the unit tests).

Regarding the conversion of Xapian::DatabaseError into zim::ZimFileFormatError - I can imagine situations where the result might appear somewhat confusing. Since the Xapian library is a 3rd party dependency for libzim (we don't control its development or the version used in a dynamically linked environment) we may report a ZimFileFormatError because of a buggy (e.g. corrupted) xapian lib. I understand that extreme corner cases like this one shouldn't count (since anything can happen and there is no 100% protection against every possible mishap). But to me intercepting and re-labeling exceptions originating in a 3rd party dependency feels a little like taking responsibility over the internals of the latter.

Having written the above in general terms I tried to check how it applies to other 3rd party dependencies. Consider that a compressed cluster is corrupted. Currently 3rd party decompression code (lzma or zstd) will likely detect an error in C fashion, libzim will check it and throw an std::runtime_error (see https://github.com/openzim/libzim/blob/9.1.0/src/compression.cpp#L53-L57, https://github.com/openzim/libzim/blob/9.1.0/src/compression.cpp#L126-L128) rather than a zim::ZimFileFormatError. This situation is a little different since we have to deal with the conversion of the error handling method (C-style to C++ style), but at least it falls short of the proposed goal that "Libzim must always throw exception inheriting from zim::ZimFileFormatError for corrupted file/database", so we may have to fix that one too.

mgautierfr commented 3 months ago

I don't have any objections against merging this PR if it eliminates some roadblock. My immediate concern is that these changes - though quite simple - are untested (at least, not by the unit tests).

Agree. Even if not easy to test. (We should probably investigate fuzz testing in a general way)

Regarding the conversion of Xapian::DatabaseError into zim::ZimFileFormatError - I can imagine situations where the result might appear somewhat confusing. Since the Xapian library is a 3rd party dependency for libzim (we don't control its development or the version used in a dynamically linked environment) we may report a ZimFileFormatError because of a buggy (e.g. corrupted) xapian lib. I understand that extreme corner cases like this one shouldn't count (since anything can happen and there is no 100% protection against every possible mishap). But to me intercepting and re-labeling exceptions originating in a 3rd party dependency feels a little like taking responsibility over the internals of the latter.

Well, it is the same case for any of our third party dependencies. If zstd is buggy and cannot uncompress valid content, we cannot really recover from that. The only solution to detect if data is corrupted would be to check the zim archive (using checksum), but we will not handle it at the decompression (nor xapian search) level.

I see two ways here:

I tend to follow the first path. We all know that all code can be buggy, but when we use it we first assume it is not. And we already take the responsibility to use it, we can (and should) take the responsibility on the error handling and labeling.

Having written the above in general terms I tried to check how it applies to other 3rd party dependencies. Consider that a compressed cluster is corrupted. Currently 3rd party decompression code (lzma or zstd) will likely detect an error in C fashion, libzim will check it and throw an std::runtime_error (see 9.1.0/src/compression.cpp#L53-L57, 9.1.0/src/compression.cpp#L126-L128) rather than a zim::ZimFileFormatError. This situation is a little different since we have to deal with the conversion of the error handling method (C-style to C++ style), but at least it falls short of the proposed goal that "Libzim must always throw exception inheriting from zim::ZimFileFormatError for corrupted file/database", so we may have to fix that one too.

I agree here. We should rethink this too to be consistent.


Please note that the original issue is that Xapian::Error do not inherit from std::exception. So all code using libzim must currently also catch Xapian::Error exception. Which is tricky has xapian is an internal dependency. When publishing prebuild libzim, we don't provide any xapian headers.

I have my preference about what to throw (explain above) but in itself, it is not such important. What is important is that user don't have to depend/link on xapian to be able to handle errors coming from our code (or called by our code).