Closed leoisl closed 3 years ago
This issue and #163 seem to cover the same issue. Closing #163 in favour of this issue.
If I understand that snippet of code in GATB correctly, provided we name the GATB graph something like a UUID then we shouldn't have any race conditions?
We can name the GATB graph using a GATB option (not sure if I remember which one exactly, sth that gives GATB the prefix to the graph). No need to fork/change GATB, but just works if we apply multiprocessing. I think multithreading is better and more efficient. Then all threads when executing has the same process ID, and thus we have the race condition on the temporary files, which are based on the process ID. In this case, we need to fork GATB and give these temporary files a new name. It can be a UUID, or a name based on process + thread ID. It just needs a name that is unique to each thread of the process (or unique everytime you create a new graph).
PS: actually, I just checked my local GATB git repo, and there were some unsaved changes. I tried to fix this issue, but IDK why I stopped. It might have been that it needed more work, but we decided to not continue on it. Or it might have been that I left to finish another day, but then we decided we don't need this at that time, so I just switched to sth else. Anyway, I just committed these local changes and pushed to a branch (https://github.com/leoisl/gatb-core/commit/6ed808e48dbf5144fd2b2682dbbab1658c7cb67c) . Very basic changes, just replaces the name of the temp files from using the process ID to a random 64-char alphanumeric string. It seems I was changing also some tests to see if this worked. I absolutely do not remember if it worked, or not, if it needs further development, etc... this can be seen only as a head start for implementing this
Does this example not name the graph already? http://gatb-core.gforge.inria.fr/doc/api/snippets_graph.html#snippets_kmer_dbg_4
I think yes... that is what we have in the examples testing multithreading in https://github.com/leoisl/gatb-core/commit/6ed808e48dbf5144fd2b2682dbbab1658c7cb67c#diff-f91f479f4338c9539549dbf7ee6206dfR37-R47
Ok, so I'm confused about why we need to fork GATB then? Can't we just generate a UUID ourselves and specify this -out
param when we create the dBG?
-out
param is going to name the graph. e.g., in this example of opened/closed files by GATB:
13:41:53 close(3</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug>) = 0
13:41:53 close(3</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/dummy.h5>) = 0
13:41:53 close(4</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp>) = 0
13:41:53 close(4</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp (deleted)>) = 0
13:41:53 close(4</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_NYjZPaZUukflMueYDwxxpyxjdJMs4EBCoGxeABS2WStZfOZHw862NmHGRzllW9vf_superK_partitions/superKparts.0>) = 0
13:41:53 close(5</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_cAgnxUKr9cNx4uVacvQMNgcNRBaTOKR8k0blFxpGzKNGXBqSl2ppE2edfZEuXKkd_debloom_partitions.h5 (deleted)>) = 0
13:41:53 close(6</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp>) = 0
13:41:53 close(6</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_kFRiQAgajQfJqpDlqd3NF340t8NwW9ALOqyEUJbySqAvrwFU6CDIfgwFFFH7uKgw_t2_kmers>) = 0
13:41:53 close(6</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_kFRiQAgajQfJqpDlqd3NF340t8NwW9ALOqyEUJbySqAvrwFU6CDIfgwFFFH7uKgw_t2_kmers (deleted)>) = 0
13:41:53 close(7</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp>) = 0
13:41:53 close(7</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_kFRiQAgajQfJqpDlqd3NF340t8NwW9ALOqyEUJbySqAvrwFU6CDIfgwFFFH7uKgw_t2_kmers (deleted)>) = 0
the only thing that -out
will be able to rename is the dummy.h5
file to graph_name.h5
. All the other temporary files (trashme_*
files) will have the same naming. I think it is confusing because the previous log is already using the fix (after trashme_
, we have a random 64-char string). Without this fix, all these files would look like (if process id is 8080, for example):
trashme_8080_...
If there are multiple threads trying to create each one a graph simultaneously, all of them will try to write and read to/from the same temporary files, causing race condition in the filesystem, as all threads have the same process ID
Ahhh I see. Makes sense now!
I asked GATB team about how hard would be to do some changes in GATB for this multithreading to work: https://github.com/GATB/gatb-core/issues/37
I did some digging today and it looks like only one thread is allowed to use the HDF5 library at any one time. The library can be built to be thread-safe though https://support.hdfgroup.org/HDF5/faq/threadsafe.html
I am trying to see if GATB is indeed built with this feature on or not...
What do you think about this link I posted on the GATB issue https://stackoverflow.com/questions/36308176/using-hdf5-thread-safe-library/47529439#47529439 ? i.e. thread-safe HDF5 lib can be accessed from C, but not from C++
Edit: I looked a bit at GATB code, and they do access HDF5 from C++ but they use the C API it seems... I wonder if this means that we can use thread-safe HDF5 lib... we still have some other issues, like singleton objects, but I guess this a big one that might be not very hard to solve... just testing I guess..
Well the problem I am running into now is that it seems the thread-safe feature can't be used for the static library - which is what GATB are using I think.
Yeah, I think they indeed use the static library, but look at the pandora code slack channel, hdf5 is not a problem anymore, we can tell GATB to use flat files instead of HDF5... but we have other issues
Closing, we use threads everywhere, except for GATB
, where we multiprocess: https://github.com/rmcolq/pandora/blob/9735340d551d34377d3a3a5dae967841e27580ee/src/denovo_discovery/discover_main.cpp#L250
We first discuss multiprocessing issues, and then multithreading.
We have issues running several
pandora map
jobs concurrently with the--discover
flag set (a sort of multiprocessing). It is not straightforward to reproduce, but what might happen is when finding paths through a candidate region, a GATB graph is created:https://github.com/rmcolq/pandora/blob/c16c1fac48ccdf98e0719ea003492c3fd2ec9034/src/denovo_discovery/denovo_discovery.cpp#L35-L38
which triggers the creation of a
dummy.h5
file in the working directory. Later on, this file is deleted. This works fine if only onepandora map
job is running on the working directory, because this part is not yet multithreaded, so only one GATB graph is created, processed and deleted at a time. However, when running severalpandora map
jobs with--discover
, all these jobs will try to create and remove thedummy.h5
file possibly simultaneously, and this could create race condition on the file system (it is hard to reproduce because it depends on the order of execution of these different jobs). This is the log that we would get in failingpandora map
jobs:I know a little bit about this because I have run into this issue in another tool already. This can be easily solved by telling GATB where to store the h5 file (and this will be the solution), but this thread is also to log how to trace the files created by GATB when creating a graph (so we know what we have to do when we decide do multithread this part).
These are the files accessed by GATB when creating a graph using the same parameters we use in pandora (obtained with
strace -y -t -e trace=open,close
):For multiprocessing, we just need to give GATB a prefix to write the
.h5
file, instead of it using the defaultdummy.h5
.For multithreading, it is a bit more complicated, because we need to ensure each thread create the temporary files (the
trashme_*
files) with a different prefix. However, GATB's way of getting a temporary filename is based on the process ID:https://github.com/GATB/gatb-core/blob/2781b0302c9feb60b48d7a142a4ee1873db2fccc/gatb-core/src/gatb/system/impl/FileSystemCommon.cpp#L182-L188
which is the same for each thread. We would need to actually fork GATB, and change this to something that is more robust to multithreading (I already have this implemented ~4 months ago, but I guess it is not finished, so we can continue this development whenever you want).