hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
170 stars 68 forks source link

daos: dcp: always cleanup all resources #379

Open daltonbohning opened 4 years ago

daltonbohning commented 4 years ago

Probably a good idea to do this along with #391, since the code paths will likely need to be retested again anyway.

dcp has some early exits for various errors, invalid arguments, etc. Some of these are DAOS-related, and some are DAOS-agnostic. In either case, it would be good practice to always free all memory and resource that have been allocated. Essentially, anywhere dcp.main() has a return, it would be a good idea to try to cleanup any resources that were in use at the time. Some of these may be harmless, such as strings not being deallocated properly, which the OS might cleanup automatically upon exiting main(), but for the DAOS dfs mounts in particular, there could be other resources in use at the time.

For DAOS-agnostic code This means paths, file list, copy options, etc. The tricky part is to avoid necessarily "copy-pasting" all of the cleanup, since there are currently ~8 mfu_free and *_delete lines at the end of dcp.c, plus the mfu_finalize and MPI_Finalize, and not all of the variables are in scope at every point of return. A possible solution could be to declare all of these variables at the top of main so they are always in scope. Then, we could either have a function that deletes all of these, or use a goto.

For DAOS-specific code This means anything in daos_cleanup. The tricky part here is that we might not know which parts need to be cleaned up at any given time. For example, if we fail to connect to the source pool, we will print an error and exit. But if we simply call daos_cleanup here, it might also try to unmount the destination dfs, disconnect from the destination pool, and close the destination container. But if we couldn't connect to the source pool, then all three of these things might fail, which would result in (1) extra time spent trying to free something that was never allocated (network calls) and (2) extraneous error messages for each failure, which might clutter the error output. Unfortunately, not all of these things can simply use a NULL check. For example, daos_handle_t dst_coh is never NULL, as the corresponding daos_* function expects it to not be NULL. Possible solutions for this sort of thing could be: (1) If, for example, we fail to connect to the source pool, then set the source pool handle to NULL. Then daos_cleanup can perform a NULL check. The drawback here is that we would have to be careful with when/how we set these things to NULL, and it should be done in a clean way. (2) We could store a bool for each variable/structure that has been successfully allocated, assuming a failed daos_* call means it is NOT allocated, and a successful daos_* call means it WAS allocated. Then, we could check these bools in daos_cleanup.

I am more partial to (1), since it would potentially be a cleaner, more standard way to keep track of what is and is not allocated.

daltonbohning commented 4 years ago

@adammoody @dsikich This really isn't extremely important, but rather a sort of "wish list" item to make dcp even more robust. I would appreciate any and all feedback on this.

adammoody commented 4 years ago

Thanks, @daltonbohning . This all sounds good to me.

I generally like to declare variables close to where they are used, since it often makes the code easier to read. However, I'm fine with declaring variables at the top if there is a good reason to, and this would be a good reason.

I'm fine with either approach on the DAOS cleanup, too, so which ever way you think is best is good.

daltonbohning commented 3 years ago

At this point, the DAOS code has been abstracted enough that we can probably put most, if not all, of the cleanup in daos_cleanup()