llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.47k stars 11.77k forks source link

Thrown exception jumps to wrong catch, resulting in incorrectly unhandled exceptions #41566

Open 4b2931fa-bc3a-444f-8f2c-52f0c193794d opened 5 years ago

4b2931fa-bc3a-444f-8f2c-52f0c193794d commented 5 years ago
Bugzilla Link 42221
Version unspecified
OS Windows NT
CC @compnerd,@rnk,@rui314,@smeenai,@zsrkmyn

Extended Description

We've recently seen a fairly reliably reproducible error cause by what seems to be bad exception handling codegen. Unfortunately, the code in question is part of a large complex project and a minimal repro case was not able to be generated.

From observations in the VS debugger, we see the following call stack / code structure and steps when handling exceptions. The following is my best attempt to describe the issue.

The 'entry point' to the stack is in one translation unit:

MetadataStorage.cpp:

std::string MetadataStorage::getString(const std::string& key) {
  std::string sql = dbqueries::MetadataGetValue::getSql();
  auto stmt = db_->prepareRead(sql);                                          #​1
  stmt->bind(dbqueries::MetadataGetValue::PLACEHOLDER_KEY, key);

  if (!stmt->step()) {
    throw MetadataStorage::KeyNotFoundError(key);
  }

  return stmt->getText(dbqueries::MetadataGetValue::COLUMN_VALUE);
}

Call #​1 is where the problem occurs. #​1 goes to SqliteData::prepareRead in the next file, where we call db_.prepare(sql) (#3):

SqliteDatabase.cpp:

SqliteDatabase::SqliteDatabase(const std::string& filename, int flags) try
    : db_(filename, flags),
      dbFilename_(filename) {
} catch (const sqlite::SQLiteException& e) {
  throw clientdb::DatabaseException(e.what(), e.getExtendedErrorCode());      #​2
}

SqliteDatabase::SqliteDatabase(sqlite3* db) : db_(db), dbFilename_("") {}

std::unique_ptr<clientdb::ReadStatement> SqliteDatabase::prepareRead(
    const std::string& sql) {
  try { 
    return std::make_unique<SqliteReadStatement>(db_.prepare(sql));           #&#8203;3
  } catch (const sqlite::SQLiteException& e) {
    throw clientdb::DatabaseException(e.what(), e.getExtendedErrorCode());    #&#8203;4
  }
}

db_.prepare(sql) goes to a third TU: Database.cpp:

Statement Database::prepare(const std::string& sql) {
  sqlite3_stmt* stmt;
  int ret = sqlite3_prepare_v2(db_, sql.c_str(), -1, &stmt, nullptr);
  if (ret != SQLITE_OK) {
    std::string strerr = sqlite3_errmsg(db_);
    int errCode = sqlite3_extended_errcode(db_);
    sqlite3_finalize(stmt);
    throw SQLiteException(strerr, errCode);                                   #&#8203;5
  }
  return Statement(*this, stmt);
}

What we see is the call stack goes roughly like this:

​1

​3

​5

At #​5, SQLiteException is thrown. We would expect the catch in prepareRead to fire and statement #​4 to execute. Instead, the VS debugger shows control passes to #​2 (an identically formed exception handler in the same TU). From here, the exception handling logic eventually calls terminate() and thus abort().

Repros with: lld 7.0.0, lld 8.0.0 Does not repro with: MS link.exe, lld 9.0.0 latest snapshot (see notes below),

Notes:

rnk commented 3 years ago

I think there are two working theories here, both have to do with ICF.

  1. ICF is merging identical code that has distinct .xdata: This would be a bug, we have tests designed to show that LLD does not do this.

  2. ICF merging equivalent catch funclets somehow breaks EH personality invariants.

I think the second is more likely. I believe you said you are using MSVC to compile, and LLD to link. I believe clang-cl uses a single combined .text section for regular code and catch handlers, so it should not be possible to merge catch funclets from different functions with clang.

Then there are the duplicate pdata entries with distinct xdata entries. I can't explain that. ICF shouldn't merge functions associated with differing .pdata/.xdata. Maybe there is something else special about MSVC's compiled code that can explain this.

This is the reduction I'm trying to work with: https://gcc.godbolt.org/z/Ge54js4TM

The catch handlers are identical and may be merged by ICF. You can see that MSVC puts the catch handlers in separate .text$x sections, which could be subject to ICF.

Anyway, this needs to be debugged on Windows, not just with godbolt to get further.

4b2931fa-bc3a-444f-8f2c-52f0c193794d commented 3 years ago

As an update, I tested post-https://github.com/llvm/llvm-project/commit/18a9b180870fcfc98da10ba9027f8eb5d0e7aba5#diff-997c507a3d001706c940347b6e136ac5350af92b727aa38101d810863b51fa48, and lld-link still emits binaries with redundant pdata entries (identical start/end addresses with different xdata addresses). There are no partially overlapping ranges (at least with my test binary with O(50k) entries); if there is a collision it is always for the exact same address range, just a different xdata value.

I'm using the data retrieved from calling dumpbin as below, with the header, footer, and prepended address removed (so just triplets of being/end/xdata):

dumpbin.exe -section:.pdata -rawdata:4,3

I have attached the cleaned up data from lld-link before and after that commit, with ICF enabled, as well as from link.exe with ICF enabled, and the result from lld-link before that commit without ICF enabled, and the python script I used to analyze them, to this gist (as the text files are rather large): https://gist.github.com/akrieger/90e55590adc66ab092f5f15e2a8e8b9d

4b2931fa-bc3a-444f-8f2c-52f0c193794d commented 5 years ago

From doing more digging, it looks like the final .pdata table has multiple entries with identical/overlapping address ranges but different unwind values. That seems like it should be a bug? This could also explain why the sort assumed begin was sufficient to sort by, because the begin addresses should all be unique, and why changing the sort function affected the final result (by reordering the duplicate entries such that the 'correct' entry ends up being found by the EH instead of the wrong one)?

4b2931fa-bc3a-444f-8f2c-52f0c193794d commented 5 years ago

I did more investigation into the problem and came up with either a viable solution or at least some additional information.

I first reverted to a version of lld that reproduces our issue. I made all changes on that version before attempting to reproduce the problem/changes on latest lld code.

My first hypothesis was that ICF was improperly merging sections that should not have been merged. I played around in COFF/ICF.cpp, changed some hashes from 32 to 64 bit, but it did not seem to make a difference.

The next hypothesis I had was a bit of a stretch. The diff which 'fixed' our problem also reversed the order that sections are looked at in various algorithms. So, I looked for other places where section ordering matters. One is the order of .pdata sections, I remember reading in documentation while looking at this.

In COFF/Writer.cpp (https://github.com/llvm-mirror/lld/blob/master/COFF/Writer.cpp#L1805) there is a function that sorts the exception tables. Notably, however, the sort function is 1) parallel, 2) does not seem to be a stable sort (I saw diffs go by which changed many sorts to stable sorts, but not this one), and 3) only considers the start address of the exception handler. Possibly, if the .pdata section was merged in ICF, it is conceivable that there are merged .pdata sections but with different unwind information sections. I changed the sort to return a.unwind < b.unwind if a.begin == b.begin, and that also made the bug stop reproducing. I added additional logging to verify the fallback case for the sort was being hit.

I am still unable to reproduce our problem on the very latest version of lld. It is possible that by changing the sort function for the exception tables I masked the real issue by coincidentally reordering things in a way which 'happens' to work.

rnk commented 5 years ago

I see, thanks. I seem to recall that the MSVC C++ exception handling personality function cares about which funclet is executing. This is probably enough information to come up with a reduced test case and to be able to go from there.

4b2931fa-bc3a-444f-8f2c-52f0c193794d commented 5 years ago

The "top of stack catch() is not catching the exception" is the actual bug, but I improperly focused on "debugger steps to wrong catch statement, program implodes shortly after". Sorry, it's been a long day with lots of building and bisecting lld.

4b2931fa-bc3a-444f-8f2c-52f0c193794d commented 5 years ago

Just to flesh out the matrix:

/OPT:REF /OPT:ICF repro /OPT:REF /OPT:NOICF: norepro /OPT:NOREF /OPT:ICF: norepro /OPT:NOREF /OPT:NOICF: norepro

Seems to only repro with both optimizations enabled. Either individually doesn't cut it.

The problem is that there is a catch higher up in the stack which should be catching the exception, and isn't. This is causing the program to abort() instead of continuing execution.

rnk commented 5 years ago

Wait, sorry, ignore #​2. I didn't find all the right instances of the linker flags I needed to change.

It does not repro if REF and ICF are disabled.

I guess the question is, then, is there any functional problem? I think this is just normal, expected ICF behavior. Catch blocks are separate functions (funclets), so they can get merged just like any other function. It just results in a confusing debugger experience.

4b2931fa-bc3a-444f-8f2c-52f0c193794d commented 5 years ago

Wait, sorry, ignore #​2. I didn't find all the right instances of the linker flags I needed to change.

It does not repro if REF and ICF are disabled.

4b2931fa-bc3a-444f-8f2c-52f0c193794d commented 5 years ago

1) MSVC only 2) Repros with and without ref and icf enabled.

rnk commented 5 years ago
  1. Are you using MSVC or clang as the compiler? I just want to confirm this is limited to LLD.
  2. Does the problem go away if you disable ICF (-opt:noicf,noref)? We have had problems in the past of merging equivalent code for functions with different EH tables. Maybe this is that or another variant of the same.
4b2931fa-bc3a-444f-8f2c-52f0c193794d commented 5 years ago

Ah, something I forgot to mention.

Above the call #​1, we have a catch-all catch:

DatabaseSchema.cpp:

  try {
    if (!libraryMetadata->needsUpdate()) {
      return;
    }

    currentDbVersion = libraryMetadata->getDbSchemaVersion(); #&#8203;0
  } catch (const std::exception& e) {
    ...
  }

​0 is the true top of stack, and we expect this catchall try/catch to catch the exception re-thrown at #​4 after handling the exception thrown at #​5. Both clientdb::DatabaseException and SQLiteException publicly extend std::runtime_error so they should be caught by a std::exception catch block.