melrom commented 6 years ago

In common_dataspaces.c, I would like to propose fail & exit for the following errors:

#ifdef DS_HAVE_DIMES
static int is_dimes_lib_init()
{
    if (!dimes_c) {
        uloga("ERROR: dimes library was not properly initialized!\n");
        return 0;
    }
    return 1;
}
#endif

static int is_dspaces_lib_init() {
    if (!dcg) {
        uloga("ERROR: dspaces library was not properly initialized!\n");
        return 0;
    }
    return 1;
}

static int is_ndim_within_bound(int ndim) {
    if (ndim > BBOX_MAX_NDIM) {
        uloga("ERROR: maximum number of array dimension supported is %d "
            "but ndim is %d\n", BBOX_MAX_NDIM, ndim);
        return 0;
    }
    return 1;
}

They are all catastrophic errors that DataSpaces cannot recover from. My proposition is to do the following;

static int is_dspaces_lib_init() {
    if (!dcg) {
        uloga("ERROR: dspaces library was not properly initialized!\n");
        lib_exit(); 
        return 0;
    }
    return 1;
}

The only reason that I can imagine that we do not cause DataSpaces to fail in the event of these types of error is so that a simulation code that is coupled with DataSpaces would continue to run anyways even though the DataSpaces piece would not be working. Thus, it might generate some relevant file information even though whatever the sim was planning on doing with intermediate data would not work.

Either way, I think users would really rather have DataSpaces fail gracefully here and fail out their job rather than have everything run and not work. Making this a ticket because I was wondering if anyone has a different opinion for why we should continue running in the event of these catastrophic errors.

philip-davis commented 6 years ago

I agree that error handling is something we need to address. Currently many errors leave the system in an undefined state. This is not good, and we should handle errors more gracefully.

I do not think we should exit, for the reasons you mentioned and also because the application code may have fall back options it can use even in the presence of a catastrophic dataspaces error.

A function to dump the warp core makes sense to me, and I think we need to decide what our claims are going to be about the state of the system once we exit. Are we going to shift to a claim that the state of the system is defined on certain exit conditions? All errors? Some errors? What level of guarantee we are giving about the state of the system after a failure has occurred will have implications for what we do in the exit routine. I think this is actually a significant piece of the semantics, as it affects what our users are confident doing in their applications own error handling.

I suggest the name dspaces_panic(). I think it conveys what is happening.

parasharmanish commented 6 years ago

I agree Philip. It is just as important to clearly document what this is.

Manish

--

Manish Parashar Office: CoRE Bldg, Rm 628 RDI2/Dept. of Computer Science Phone: (848) 445-5388 Rutgers University Fax: (732) 445-0537 110 Frelinghuysen Road Email: parashar at rutgers dot edu Piscataway, NJ 08854-8019 WWW: http://parashar.rutgers.edu

On Jan 26, 2018, at 10:14 PM, Philip Davis notifications@github.com<mailto:notifications@github.com> wrote:

I agree that error handling is something we need to address. Currently many errors leave the system in an undefined state. This is not good, and we should handle errors more gracefully.

I do not think we should exit, for the reasons you mentioned and also because the application code may have fall back options it can use even in the presence of a catastrophic dataspaces error.

A function to dump the warp core makes sense to me, and I think we need to decide what our claims are going to be about the state of the system once we exit. Are we going to shift to a claim that the state of the system is defined on certain exit conditions? All errors? Some errors? What level of guarantee we are giving about the state of the system after a failure has occurred will have implications for what we do in the exit routine. I think this is actually a significant piece of the semantics, as it affects what our users are confident doing in their applications own error handling.

I suggest the name dspaces_panic(). I think it conveys what is happening.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fphilip-davis%2Fdataspaces%2Fissues%2F34%23issuecomment-360955731&data=02%7C01%7Cparashar%40ored.rutgers.edu%7Ccbd597fd8ac443adedc108d565341ed5%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636526196903840029&sdata=ngVN7ukjyGkwN7NeO%2FNz2QnkbasPjzkFKiNQ74FW2zA%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAIvVmdyG22fmdUl9iolVUZH33ZCYUpQ4ks5tOpSkgaJpZM4Rqqda&data=02%7C01%7Cparashar%40ored.rutgers.edu%7Ccbd597fd8ac443adedc108d565341ed5%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636526196903840029&sdata=63lIshT1NjQB7e%2Fn2AIBUdBYUTwW%2F%2BhR8McmZg2D2ls%3D&reserved=0.

philip-davis / dataspaces

Failing on Errors #34

--

Manish Parashar Office: CoRE Bldg, Rm 628 RDI2/Dept. of Computer Science Phone: (848) 445-5388 Rutgers University Fax: (732) 445-0537 110 Frelinghuysen Road Email: parashar at rutgers dot edu Piscataway, NJ 08854-8019 WWW: http://parashar.rutgers.edu