iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
110 stars 14 forks source link

Multiprocess/multithread `pandora map --discover` issue with GATB graph creation #195

Closed leoisl closed 3 years ago

leoisl commented 4 years ago

We first discuss multiprocessing issues, and then multithreading.

We have issues running several pandora map jobs concurrently with the --discover flag set (a sort of multiprocessing). It is not straightforward to reproduce, but what might happen is when finding paths through a candidate region, a GATB graph is created:

https://github.com/rmcolq/pandora/blob/c16c1fac48ccdf98e0719ea003492c3fd2ec9034/src/denovo_discovery/denovo_discovery.cpp#L35-L38

which triggers the creation of a dummy.h5 file in the working directory. Later on, this file is deleted. This works fine if only one pandora map job is running on the working directory, because this part is not yet multithreaded, so only one GATB graph is created, processed and deleted at a time. However, when running several pandora map jobs with --discover, all these jobs will try to create and remove the dummy.h5 file possibly simultaneously, and this could create race condition on the file system (it is hard to reproduce because it depends on the order of execution of these different jobs). This is the log that we would get in failing pandora map jobs:

[2019-11-15 01:37:20.166592] [0x0000HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
  #000: /hps/nobackup/research/zi/projects/pandora_paper/pandora/pandora/build/ext/gatb-core-1.4.1/gatb-core/thirdparty/hdf5/src/H5F.c line 522 in H5Fcreate
(): unable to create file
    major: File accessibilty
    minor: Unable to open file
  #001: /hps/nobackup/research/zi/projects/pandora_paper/pandora/pandora/build/ext/gatb-core-1.4.1/gatb-core/thirdparty/hdf5/src/H5Fint.c line 1125 in H5F_o
pen(): unable to build actual name
    major: File accessibilty
    minor: Unable to initialize object
  #002: /hps/nobackup/research/zi/projects/pandora_paper/pandora/pandora/build/ext/gatb-core-1.4.1/gatb-core/thirdparty/hdf5/src/H5Fint.c line 1575 in H5F_b
uild_actual_name(): can't retrieve stat info for file
    major: File accessibilty
    minor: Can't get value
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
  #000: /hps/nobackup/research/zi/projects/pandora_paper/pandora/pandora/build/ext/gatb-core-1.4.1/gatb-core/thirdparty/hdf5/src/H5G.c line 454 in H5Gopen2(
): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: /hps/nobackup/research/zi/projects/pandora_paper/pandora/pandora/build/ext/gatb-core-1.4.1/gatb-core/thirdparty/hdf5/src/H5Gloc.c line 253 in H5G_lo

...
...
...

HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
  #000: /hps/nobackup/research/zi/projects/pandora_paper/pandora/pandora/build/ext/gatb-core-1.4.1/gatb-core/thirdparty/hdf5/src/H5S.c line 883 in H5Sget_simple_extent_dims(): not a dataspace
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
  #000: /hps/nobackup/research/zi/projects/pandora_paper/pandora/pandora/build/ext/gatb-core-1.4.1/gatb-core/thirdparty/hdf5/src/H5S.c line 392 in H5Sclose(): not a dataspace
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
  #000: /hps/nobackup/research/zi/projects/pandora_paper/pandora/pandora/build/ext/gatb-core-1.4.1/gatb-core/thirdparty/hdf5/src/H5D.c line 993 in H5Dset_extent(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
  #000: /hps/nobackup/research/zi/projects/pandora_paper/pandora/pandora/build/ext/gatb-core-1.4.1/gatb-core/thirdparty/hdf5/src/H5D.c line 993 in H5Dset_extent(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.18) thread 0:
  #000: /hps/nobackup/research/zi/projects/pandora_paper/pandora/pandora/build/ext/gatb-core-1.4.1/gatb-core/thirdparty/hdf5/src/H5D.c line 993 in H5Dset_extent(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
terminate called after throwing an instance of 'gatb::core::system::Exception'

I know a little bit about this because I have run into this issue in another tool already. This can be easily solved by telling GATB where to store the h5 file (and this will be the solution), but this thread is also to log how to trace the files created by GATB when creating a graph (so we know what we have to do when we decide do multithread this part).

These are the files accessed by GATB when creating a graph using the same parameters we use in pandora (obtained with strace -y -t -e trace=open,close):

13:41:53 close(3</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug>) = 0
13:41:53 close(3</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/dummy.h5>) = 0
13:41:53 close(4</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp>) = 0
13:41:53 close(4</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp (deleted)>) = 0
13:41:53 close(4</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_NYjZPaZUukflMueYDwxxpyxjdJMs4EBCoGxeABS2WStZfOZHw862NmHGRzllW9vf_superK_partitions/superKparts.0>) = 0
13:41:53 close(5</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_cAgnxUKr9cNx4uVacvQMNgcNRBaTOKR8k0blFxpGzKNGXBqSl2ppE2edfZEuXKkd_debloom_partitions.h5 (deleted)>) = 0
13:41:53 close(6</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp>) = 0
13:41:53 close(6</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_kFRiQAgajQfJqpDlqd3NF340t8NwW9ALOqyEUJbySqAvrwFU6CDIfgwFFFH7uKgw_t2_kmers>) = 0
13:41:53 close(6</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_kFRiQAgajQfJqpDlqd3NF340t8NwW9ALOqyEUJbySqAvrwFU6CDIfgwFFFH7uKgw_t2_kmers (deleted)>) = 0
13:41:53 close(7</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp>) = 0
13:41:53 close(7</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_kFRiQAgajQfJqpDlqd3NF340t8NwW9ALOqyEUJbySqAvrwFU6CDIfgwFFFH7uKgw_t2_kmers (deleted)>) = 0

For multiprocessing, we just need to give GATB a prefix to write the .h5 file, instead of it using the default dummy.h5.

For multithreading, it is a bit more complicated, because we need to ensure each thread create the temporary files (the trashme_* files) with a different prefix. However, GATB's way of getting a temporary filename is based on the process ID:

https://github.com/GATB/gatb-core/blob/2781b0302c9feb60b48d7a142a4ee1873db2fccc/gatb-core/src/gatb/system/impl/FileSystemCommon.cpp#L182-L188

which is the same for each thread. We would need to actually fork GATB, and change this to something that is more robust to multithreading (I already have this implemented ~4 months ago, but I guess it is not finished, so we can continue this development whenever you want).

mbhall88 commented 4 years ago

This issue and #163 seem to cover the same issue. Closing #163 in favour of this issue.

mbhall88 commented 4 years ago

If I understand that snippet of code in GATB correctly, provided we name the GATB graph something like a UUID then we shouldn't have any race conditions?

leoisl commented 4 years ago

We can name the GATB graph using a GATB option (not sure if I remember which one exactly, sth that gives GATB the prefix to the graph). No need to fork/change GATB, but just works if we apply multiprocessing. I think multithreading is better and more efficient. Then all threads when executing has the same process ID, and thus we have the race condition on the temporary files, which are based on the process ID. In this case, we need to fork GATB and give these temporary files a new name. It can be a UUID, or a name based on process + thread ID. It just needs a name that is unique to each thread of the process (or unique everytime you create a new graph).

PS: actually, I just checked my local GATB git repo, and there were some unsaved changes. I tried to fix this issue, but IDK why I stopped. It might have been that it needed more work, but we decided to not continue on it. Or it might have been that I left to finish another day, but then we decided we don't need this at that time, so I just switched to sth else. Anyway, I just committed these local changes and pushed to a branch (https://github.com/leoisl/gatb-core/commit/6ed808e48dbf5144fd2b2682dbbab1658c7cb67c) . Very basic changes, just replaces the name of the temp files from using the process ID to a random 64-char alphanumeric string. It seems I was changing also some tests to see if this worked. I absolutely do not remember if it worked, or not, if it needs further development, etc... this can be seen only as a head start for implementing this

mbhall88 commented 4 years ago

Does this example not name the graph already? http://gatb-core.gforge.inria.fr/doc/api/snippets_graph.html#snippets_kmer_dbg_4

leoisl commented 4 years ago

I think yes... that is what we have in the examples testing multithreading in https://github.com/leoisl/gatb-core/commit/6ed808e48dbf5144fd2b2682dbbab1658c7cb67c#diff-f91f479f4338c9539549dbf7ee6206dfR37-R47

mbhall88 commented 4 years ago

Ok, so I'm confused about why we need to fork GATB then? Can't we just generate a UUID ourselves and specify this -out param when we create the dBG?

leoisl commented 4 years ago

-out param is going to name the graph. e.g., in this example of opened/closed files by GATB:

13:41:53 close(3</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug>) = 0
13:41:53 close(3</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/dummy.h5>) = 0
13:41:53 close(4</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp>) = 0
13:41:53 close(4</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp (deleted)>) = 0
13:41:53 close(4</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_NYjZPaZUukflMueYDwxxpyxjdJMs4EBCoGxeABS2WStZfOZHw862NmHGRzllW9vf_superK_partitions/superKparts.0>) = 0
13:41:53 close(5</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_cAgnxUKr9cNx4uVacvQMNgcNRBaTOKR8k0blFxpGzKNGXBqSl2ppE2edfZEuXKkd_debloom_partitions.h5 (deleted)>) = 0
13:41:53 close(6</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp>) = 0
13:41:53 close(6</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_kFRiQAgajQfJqpDlqd3NF340t8NwW9ALOqyEUJbySqAvrwFU6CDIfgwFFFH7uKgw_t2_kmers>) = 0
13:41:53 close(6</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_kFRiQAgajQfJqpDlqd3NF340t8NwW9ALOqyEUJbySqAvrwFU6CDIfgwFFFH7uKgw_t2_kmers (deleted)>) = 0
13:41:53 close(7</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_Fv9jAVgtASLA29lyjoEShyn9vIUqsqt8r9zUXdjdjqtbffTWmPxa1TN8ggS77WW6_cfp>) = 0
13:41:53 close(7</home/leandro/git/gatb-core-v1.4.1/gatb-core/cmake-build-debug/bin/Debug/trashme_kFRiQAgajQfJqpDlqd3NF340t8NwW9ALOqyEUJbySqAvrwFU6CDIfgwFFFH7uKgw_t2_kmers (deleted)>) = 0

the only thing that -out will be able to rename is the dummy.h5 file to graph_name.h5. All the other temporary files (trashme_* files) will have the same naming. I think it is confusing because the previous log is already using the fix (after trashme_, we have a random 64-char string). Without this fix, all these files would look like (if process id is 8080, for example):

trashme_8080_...

If there are multiple threads trying to create each one a graph simultaneously, all of them will try to write and read to/from the same temporary files, causing race condition in the filesystem, as all threads have the same process ID

mbhall88 commented 4 years ago

Ahhh I see. Makes sense now!

leoisl commented 4 years ago

I asked GATB team about how hard would be to do some changes in GATB for this multithreading to work: https://github.com/GATB/gatb-core/issues/37

mbhall88 commented 4 years ago

I did some digging today and it looks like only one thread is allowed to use the HDF5 library at any one time. The library can be built to be thread-safe though https://support.hdfgroup.org/HDF5/faq/threadsafe.html

I am trying to see if GATB is indeed built with this feature on or not...

leoisl commented 4 years ago

What do you think about this link I posted on the GATB issue https://stackoverflow.com/questions/36308176/using-hdf5-thread-safe-library/47529439#47529439 ? i.e. thread-safe HDF5 lib can be accessed from C, but not from C++

Edit: I looked a bit at GATB code, and they do access HDF5 from C++ but they use the C API it seems... I wonder if this means that we can use thread-safe HDF5 lib... we still have some other issues, like singleton objects, but I guess this a big one that might be not very hard to solve... just testing I guess..

mbhall88 commented 4 years ago

Well the problem I am running into now is that it seems the thread-safe feature can't be used for the static library - which is what GATB are using I think.

leoisl commented 4 years ago

Yeah, I think they indeed use the static library, but look at the pandora code slack channel, hdf5 is not a problem anymore, we can tell GATB to use flat files instead of HDF5... but we have other issues

leoisl commented 3 years ago

Closing, we use threads everywhere, except for GATB, where we multiprocess: https://github.com/rmcolq/pandora/blob/9735340d551d34377d3a3a5dae967841e27580ee/src/denovo_discovery/discover_main.cpp#L250