seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
396 stars 81 forks source link

Sporadic failures in FM-index creation in a multi-threading context #3258

Closed behrj closed 1 week ago

behrj commented 1 month ago

Does this problem persist on the current main?

Is there an existing issue for this?

Current Behavior

I am using the FM index to search for kmer hits in a large number of regions of a few kb up to 3Mb size. This is done in a multi-threading context. (Please advise if you think this is not supported / if you have a better idea for this purpose). Running the c++ binary 100 times on a given input file gave me 1 failure. Increasing the number of threads I can increase the failure rate up to 7%. I believe this is related to the construction of the int_vector here: https://github.com/xxsds/sdsl-lite/blob/d54d38908a14745eb93fd5304fc9b2b9c2542ee9/include/sdsl/construct.hpp#L71 Which creates the name of an in-memory file using the current pid and an id which is created here: https://github.com/xxsds/sdsl-lite/blob/d54d38908a14745eb93fd5304fc9b2b9c2542ee9/include/sdsl/util.hpp#L355

The id is not thread save, so the name of the in-memory file may clash between threads.

Expected Behavior

I'd hope to be able to use this in a multi-threading context.

Steps To Reproduce

I did not break this down to a minimal executable, but I believe that constructing FM indices in a multi-treaded context should result in such clashes.

Environment

- Operating system: Both MacOS ARM as well as Ubuntu Linux 
- SeqAn version: Currently using 3.2.0
- Compiler: g++-12

Anything else?

I am currently testing this patch:

345,349d344
< struct _id_helper_struct
< {
<     uint64_t id = 0;
< };
<
352,353c347,348
<     static _id_helper_struct data;
<     return data.id++;
---
>     static std::atomic<uint64_t> id{0};
>     return id++;

on this file seqan3/submodules/sdsl-lite/include/sdsl/util.hpp And it seems to resolve the issue.

rrahn commented 1 month ago

Hi @behrj, thanks for reaching out to us. I would need to investigate what is actually happening inside of sdsl. But just to make sure, you are creating multiple FM-indices in parallel?

behrj commented 1 month ago

Hi @behrj, thanks for reaching out to us. I would need to investigate what is actually happening inside of sdsl. But just to make sure, you are creating multiple FM-indices in parallel?

Yes, that is right. Each thread creates its own FM index and searches for kmers in a specific genomic region. I might be misusing the tool in a way, but it does serve me well so far.

SGSSGene commented 1 month ago

@behrj very nice find. Yes that is definite a bug in the SDSL and your patch seems to be a valid way around it.

behrj commented 1 month ago

I can confirm that the patch works fine for me. I've seen no more failures with this running the very same setup that had a 7% failure rate.

eseiler commented 1 month ago

Yes, definitely a bug.

For the fix, we should probably make it global or something? If it's a static variable, different TUs might still clash.

Edit: It's the other way around, isn't it? Static (global) variables are TU specific, static class members are shared across TU? Anyway, we should use whatever version is shared across TUs :)

I'll have to update the sdsl for gcc-14 and clang 18, so I'll do the fix at the latest then. Will only have time next week, though.

SGSSGene commented 1 month ago

Yes, your edit is correct. The variable is already shared across TUs. @behrj analysis is spot on. The issue is that multiple threads increment and read the same variable at the same time. This is a race condition with the possibility to produce the same value in different threads.

Using an atomic will avoid and fix this race condition.

eseiler commented 1 month ago

Good, then it's settled. I'll also check whether the thread sanitizer catches that - I don't think there are multithreaded tests. I'll also check clang tidy. I have both set up for other repos, so it shouldn't be much work.

rrahn commented 1 month ago

@SGSSGene thanks for replying so quickly. I would also go for the suggested fix. Maybe we can even remove the extern keyword? I don't see why this matters as all inline and non-inline functions have external linkage by defaut (at least in c++). Or may there be side effects I am not seeing?

rrahn commented 1 month ago

https://github.com/xxsds/sdsl-lite/blob/d54d38908a14745eb93fd5304fc9b2b9c2542ee9/include/sdsl/util.hpp#L353C24-L353C34

That of _id_helper?

SGSSGene commented 1 month ago

https://github.com/xxsds/sdsl-lite/blob/d54d38908a14745eb93fd5304fc9b2b9c2542ee9/include/sdsl/util.hpp#L353C24-L353C34

That of _id_helper?

Ah, yes. That extern is not required.