Open davidchisnall opened 4 years ago
@wintersteiger , FYI: please keep this in mind when replacing the config data. I hope that we'll get a lot of the fixes in sgxlkl_app_config.c
for free when we move to using JSON and @mikbras's JSON library.
Note that we have SGXLKL_ASSERT
for assertions.
My work is mostly between OE and LKL, i.e. before LKL starts up. Can I use all of OE's musl at those points in the code or just the *alloc functions?
The new json library also depends on libc (e.g. string functions), but I think we can work around them within reasonable effort.
Is anybody emotionally attached to sgxlkl_app_config_t
? It would simplify all of this quite a bit if we just removed it and kept only one config structure (e.g. half-initialized *disk_config_t
's being passed around to mount filesystems etc).
@wintersteiger conceptually it makes sense to me to have a struct that contains the full app_config, the way it has passed attestation. Or do I misunderstand your proposal?
@prp basically, yes. I started with putting the entire configuration into the json configs, I think there will be very few settings that we don't want attested, so I don't think we'll need a separate config for that purpose. And all of the shared memory should go into a separate struct, not into sgxlkl_enclave
.
@wintersteiger ok, I see. So the question is if the parsed JSON app_config goes into a single app_config struct or into many separate structs/variables for the different SGX-LKL components.
Yes, I think we should have a single struct that corresponds to the json config, otherwise it will be very tricky for users/reviewers to see what is in fact attested and what is not. I don't think we'll need an additional unattested config, but if it turns out we do, we should keep it separate (as opposed to the current merge of host, enclave, and app configs).
I agree, having a separate attested config makes a lot of sense, I don't have strong opinions as to whether we want attested vs nonattested config or a load of foo_config_attested
and bar_config_unattested
things.
We don't want to depend on OE's libc, not least because it also pulls in a load of oesyscalls dependencies, which increase our attack surface, but also because (assuming we can make it work at all) we'll end up with two copies of a load of libc things.
I'm happy for us to depend on things that a freestanding C environment would provide.
OE core provides a number of these functions that we can use:
Functions expected in any freestanding C implementation:
memcmp
, memcpy
, memmove
memset
.
oe_
-prefixed versions of some useful things:
oe_strlen
, oe_snprinf
, oe_strchr
, oe_strchrnul
, oe_strcmp
, oe_strcspn
, oe_strdup
, oe_strlcat
, oe_strlcpy
, oe_strlen
, oe_strncmp
, oe_strrchr
, oe_strspn
, oe_strstr
, oe_strtok_r
, oe_strtoul
.
We should add a compatibility header that #define
s their non-oe_
versions to call these and -include
it when building all of the enclave files.
These are used in the new json parser without an oe_*
version: strcpy
, strtoll
, strtod
, strtoull
.
Would it simplify things if we built src/enclave
without sgxlkl-musl and only with oe-musl?
malloc to oe_malloc is going to be a little tricky in a couple places. I'm still tracking it down.
When making some updates from calloc to oe_calloc, strdup to oe_strdup, and malloc to oe_malloc, some of the test suite segfaults. I spent a couple hours today looking at it and it's unclear to me at the moment why. I thought I had missed something important where perhaps memory allocated with oe_malloc was being freed using free either inside or outside of the src/enclave tree. However, that doesn't appear to be the case (although its possible I missed something that is non-obvious).
I'm opening a PR for the changes related to malloc/free/calloc/strdup that are definitely contained to src/enclave and don't cause any segfaults. Baby steps to closing the ticket. Baby steps.
Status update...
This didn't end up where I expected it to. There's a 151_2
branch for this repo, plus a corresponding one for sgx-lkl-musl. It contains the additional calloc, malloc, strdup changes for this issue. However, a couple applications in the test suite (for example dotnet) segfault.
They only segfault in HW mode. Never in SW mode.
The initial thought was "we don't have enough memory and are getting back NULL from a malloc/calloc and that is the issue". The code on the branch contains NULL checks after the new oe_malloc and oe_calloc calls (many of the replaced calls made no such checks). That didn't turn anything up.
The next thought was "we are running out of memory somewhere else because of this change". Given that could happen in a number of locations, I increased various memory limits. You can see these in the branches as well. Same problem.
What exists now on the branches also includes a lot of printf style debugging so we can trace through where things blow up.
Here is what that shows:
I tested this by changing https://github.com/lsds/sgx-lkl/blob/oe_port/src/enclave/enclave_init.c#L209
Setting
libc.user_tls_enabled = sgxlkl_enclave->mode == SW_DEBUG_MODE ? 1 : sgxlkl_enclave->fsgsbase;
so that it is either:
ON:
libc.user_tls_enabled = 1;
OFF:
libc.user_tls_enabled = 0;
results in no segfault when off, and segfault when on.
The printf debugging will show the different code paths that are taken. I've gotten to the point where I know that we get to https://github.com/lsds/sgx-lkl-musl/blob/oe_port/src/env/__init_tls.c#L50 and then segfault.
My working assumption is that assumptions about memory layout related to the app_config that is used for configuration are being made. However, that is just an assumption. There's a lot of nuance in the thread local storage code that is spread out over a number of files that would need to be investigated.
I'm setting aside work on this for now. @davidchisnall believes that when https://github.com/lsds/sgx-lkl/issues/155 is completed, we can do away with the problematic code entirely and replace with a user space implementation of pthreads. If that turns out to be correct then there's not a lot of point in figuring out the specific issue. For that reason, I'm pausing work on the additional malloc/calloc/strdup => oe_malloc/oe_calloc/oe_strdup changes for the time being.
See https://github.com/lsds/sgx-lkl/pull/234/files for additional context on this without having to check out the branches.
(Just in case you haven't come across it before: thread locals may be reinitialized upon ecalls. For some reason this was the default behavior when I last looked at it, but it may not be anymore. It's called the "TCS policy"; see e.g. https://github.com/intel/linux-sgx/issues/283)
@wintersteiger, I believe this is not an issue with the recent OE, which ensures that FS base is preserved across ecalls.
malloc changes are now passing CI. there was a fix in OE that addressed the problem.
Yay, ticks!
__assert_fail is done but I won't tick until its merged.
fprintf/__stderr_FILE is done but I won't tick until its merged
I'm as assignee so I don't lose track of checking what of this has been done once #604 is merged.
Updated to reflect changes from #604 and associated work.
As part of the relayering work, we should ensure that nothing in
src/enclave
depends on libc init. Some of these changes are simple and mechanical:malloc
tooe_malloc
so that we don't need musl's version of malloc to be working before we start.sgxlkl_fail
.Others are more complex.
We can't remove a small set of things that we expect to exist in a freestanding C environment (e.g.
memset
/memcpy
), because the compiler will emit these even if we don't.Checklist:
enclave_event_channel.c
calloc
/free
__assert_fail
enclave_init.c
__init_libc
/__init_tls
/__libc
/__libc_start_init
. This can't be fixed until we can start LKL without any other libc dependencies.malloc
.strdup
wg_get_device
/wg_key_to_base64
/wgu_add_peers
/wgu_list_devices
enclave_mem.c
___errno_location
. We should not be usingerrno
anywhere at this layer.mprotect
should be an explicit call to the host mprotect.lseek
/read
are both used in the syscall replacements. These functions should be moved intosrc/lkl
and should call the relevant LKL functions directly, not go via libc.enclave_oe.c
__init_tls
/_dlstart_c
should be removed after we no longer depend on libc for this layer.enclave_util.c
snprintf
This is used only for tracing. The tracing is LKL-specific and so should be insrc/lkl
. This particular usage can be replaced bystrcpy
/memcpy
because it's just appending to a buffer.sgxlkl_app_config.c
___errno_location
. We should not be usingerrno
anywhere at this layer.fprintf
/__stderr_FILE
should besgxlkl_verbose
/sgxlkl_fail
.malloc
/free
strdup
snprintf
strerror
(We shouldn't be calling anything that setserrno
, so this should go away once the other fixes happen)mpmc_queue.c
assert
(Replaced withSGXLKL_ASSERT
)The full list of dependencies can be seen by running the following command: