shadow / shadow-plugin-tor

A Shadow plug-in that runs the Tor anonymity software
https://shadow.github.io
Other
50 stars 39 forks source link

OpenSSL 1.1.0i and later segfaults #92

Open florian-vuillemot opened 4 years ago

florian-vuillemot commented 4 years ago

The openssl dependancy seems to have moved from https://www.openssl.org/source/openssl-1.0.1e.tar.gz to https://openssl.org/source/old/1.0.1/openssl-1.0.1.tar.gz.

ERROR '/home/user/Documents/shadow/shadow-plugin-tor/build/openssl-1.0.1e.tar.gz' is not a tarfile
2020-04-04 11:53:53,721 INFO returning code '-1'
florian-vuillemot commented 4 years ago

I provided a patch to allow the users to continue working/building. But maybe should we think to migrate on a new version of OpenSSL ?

robgjansen commented 4 years ago

Yeah, it would be nice to use the new version of OpenSSL. I pushed a PR here: https://github.com/shadow/shadow-plugin-tor/pull/94/files

@florian-vuillemot could you test this new version of OpenSSL on your local machine to see if it works?

robgjansen commented 4 years ago

Hmm, OK it looks like that new version of OpenSSL failed to pass the CI tests: https://github.com/shadow/shadow-plugin-tor/pull/94

So maybe we should first accept the PR for the old version, and then figure out the issues with the newer version.

robgjansen commented 4 years ago

I merged the URL change in #93, which continues to use openssl v1.0.1e. Let's use this issue to track updating to something newer, e.g., openssl v1.1.1f.

sporksmith commented 4 years ago

Bumping OpenSSL alone as in #94 causes libevent to fail to compile. Bumping libevent as well (to 2.1.11) gets it to compile, but in the Github CI run segfaults. Locally I don't get a segfault, but the simulated processes abort.

Continuing to debug - going to try changing Shadow's emulated abort to abort for real so that I can get a core dump.

Alternatively I suppose it'd be nice if we could get a debuggable core dump from the github run. We'd need to grab the core dump itself, the compiled binaries, and any relevant compiled libraries.

sporksmith commented 4 years ago

gdb hangs trying to load the core.

Running shadow under gdb I was able to get a stack trace. gdb just gives addresses without symbols, but cross-referencing with /proc/x/maps, it looks like the elf loader is involved. Perhaps best to punt on this pending https://github.com/shadow/shadow/issues/738.

Btw I also tried commenting out the crypto overrides - in that case the simulation didn't segfault but seemed to hang for a while and then fail.

I also fixed some type errors in those overrides; that didn't seem to make a difference but I'll send a PR to incorporate them.

jtracey commented 4 years ago

gdb hangs trying to load the core.

It's likely trying to load each plugin with a scan of the entire linkmap, resulting in quadratic time. It's a known issue with upstream gdb we never got around to fixing.

Running shadow under gdb I was able to get a stack trace. gdb just gives addresses without symbols, but cross-referencing with /proc/x/maps, it looks like the elf loader is involved.

You can find instructions for working with gdb in the Shadow documentation. Basically, you have to use elf-loader functions to only load the symbols you need, otherwise the quadratic operation I mentioned makes it unusable. Let me know if you have any questions.

Perhaps best to punt on this pending shadow/shadow#738.

As a heads up, debugging is actually likely to get more difficult after moving to multi-process. That's actually the primary reason why NS3's DCE opted for creating elf-loader instead of going multi-process. You can read more about that in this paper from them.

sporksmith commented 4 years ago

It's likely trying to load each plugin with a scan of the entire linkmap, resulting in quadratic time. It's a known issue with upstream gdb we never got around to fixing.

Ah, good to know.

Running shadow under gdb I was able to get a stack trace. gdb just gives addresses without symbols, but cross-referencing with /proc/x/maps, it looks like the elf loader is involved.

You can find instructions for working with gdb in the Shadow documentation. Basically, you have to use elf-loader functions to only load the symbols you need, otherwise the quadratic operation I mentioned makes it unusable. Let me know if you have any questions.

Oh cool, I'll take another look using the bt_load helper.

Perhaps best to punt on this pending shadow/shadow#738.

As a heads up, debugging is actually likely to get more difficult after moving to multi-process. That's actually the primary reason why NS3's DCE opted for creating elf-loader instead of going multi-process. You can read more about that in this paper from them.

Fair enough; the linked workarounds should make debugging in the current mode better than I thought, and yes going to multiprocess certainly adds new complexities :).

Thanks!

sporksmith commented 4 years ago

I need to stop for today, but it looks like RSA_new is returning NULL.

sporksmith commented 4 years ago

@jtracey any idea why setting breakpoints inside plugin code wouldn't work as expected?

I tried first setting a breakpoint at process_emu_read, since the plugin should have loaded by the time we hit that. Once there I turn on locking, run "p vdl_linkmap_abi_update()", set a breakpoint at RSA_new, turn off locking, and continue. The breakpoint doesn't seem to trigger though, and I get stopped again at an abort, with ~RSA_new~ crypto_pk_new in the call stack just after returning from RSA_new.

As a workaround going to try adding a raise(SIGSTOP) in the source...

jtracey commented 4 years ago

Breakpoints inside plugins get tricky because under the hood of gdb, breakpoints don't actually apply to symbol names or symbols, they apply to memory addresses (specifically, they modify the code stored at the location of that breakpoint). Under normal compiles, each instance of the same plugin shares code pages, to conserve memory. But with debug builds, we give each plugin its own address space, so you can debug multiple instances of a plugin independently. If that's not the behavior you want, and you'd rather not modify the source like you said, you can try removing these #ifndefs:

https://github.com/shadow/shadow/blob/08035fcfe30375ff53d43dd809305c897f127d16/src/external/elf-loader/vdl-map.c#L604 and https://github.com/shadow/shadow/blob/08035fcfe30375ff53d43dd809305c897f127d16/src/external/elf-loader/vdl-map.c#L722

If you do that, then modifying the code page (e.g., adding a breakpoint) to one instance of a plugin will make the change to all instances. However, you still can't add a breakpoint until after the plugin has been loaded this execution (the specific plugin in the current behavior, or the first plugin if you make those changes), else gdb doesn't know where the address is.

sporksmith commented 4 years ago

@jtracey thanks, makes sense.

I pivoted a bit and binary-searched OpenSSL to find the change that broke us. The last release that works is 1.1.0h, which I moved us to in #98 (I accidentally left the wrong version # in the squashed commit message).

I then did a git-bisect between 1.1.0h and 1.1.0i and found the exact commit that broke us: https://github.com/openssl/openssl/commit/bf21fe935a979c08292d06553ef8c9a49382208c

Based on that diff it seems the most likely issues are global initialization order or thread local storage.