pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
196 stars 40 forks source link

Race condition in ".pathnames.iv" file in position command #493

Open ASLeonard opened 1 year ago

ASLeonard commented 1 year ago

Hi, Probably not a common case, but querying positions in a graph in an outer parallel loop is buggy, because of this line writing a local pathname file with a fixed name.

https://github.com/pangenome/odgi/blob/2c9a17f0ce1f72382c92b0c84cc334f0d05b7c48/src/algorithms/xp.cpp#L113

Running parallel -j 1 ... fixed the issue for me, so pretty sure it is exactly due to a race condition on this line. It seems there is another similar case here.

Best, Alex

subwaystation commented 1 year ago

Hi @ASLeonard,

the code link you sent is not run in parallel, as far as I can see. The construction of the XP index is single threaded, with the exception being the https://github.com/ekg/mmmulti we use. Furthermore, which exact command did you use? Which exact error did you get? Without further information I don't know how to help you. Thanks!

ASLeonard commented 1 year ago

Sorry for not being clear. This is the command I was running something like

parallel -j 4 odgi position -i pggb.og -p HER:{1}-{2}  -E -d 1000 -o /dev/stdout ::: <paths of interest> 

So the parallelism is running multiple calls of odgi position simultaneously, not running odgi parallel with multiple threads.

And this is one of the errors when running in the directory /cluster/work/alex/KIT

terminate called after throwing an instance of 'std::logic_error'
  what():  Error: File "/cluster/work/alex/KIT/.pathnames.iv" contains zero symbol.
warning [libhandlegraph]: Serialized object does not appear to match deserialzation type.
warning [libhandlegraph]: It is either an old version or in the wrong format.
warning [libhandlegraph]: Attempting to load it anyway. Future releases will reject it!
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error rewinding to load non-magic-prefixed SerializableHandleGraph

Since there are multiple instances of odgi position running at the same time in the same directory, all calls are seemingly reading/writing to the same "hardcoded" filename of "/cluster/work/alex/KIT/.pathnames.iv"

subwaystation commented 1 year ago

I see. From what I understand the problem is not originating from the XP index, because here we generate the file paths randomly. Also non of the code of the XP index is invoked within position_main.cpp. However, warning [libhandlegraph]: hints that something goes wrong when accessing the input ODGI file? Note sure. Hopefully @ekg knows more.

subwaystation commented 1 year ago

Which version of ODGI are you using @ASLeonard ?

ASLeonard commented 1 year ago

I built odgi from tip (a054641).

From what I understand the problem is not originating from the XP index, because here we generate the file paths randomly. Also non of the code of the XP index is invoked within position_main.cpp.

I'm less sure this is the reason then based on your experience, but the problem only appears when calling multiple odgi positions in parallel and never occurs when running multiple odgi position calls sequentially, so I still think the issue is with different calls racing to the "/cluster/work/alex/KIT/.pathnames.iv" file.

AndreaGuarracino commented 1 year ago

We could fix the problem by generating files with random names. Strangely, your files have "no name", that is they have names starting with a dot.


From: Alex Leonard @.> Sent: Wednesday, April 12, 2023 3:44:56 PM To: pangenome/odgi @.> Cc: Subscribed @.***> Subject: Re: [pangenome/odgi] Race condition in ".pathnames.iv" file in position command (Issue #493)

I built odgi from tip (a054641https://github.com/pangenome/odgi/commit/a054641b220db1b5754d8cb1e298e96f98be439f).

From what I understand the problem is not originating from the XP index, because here we generate the file paths randomly. Also non of the code of the XP index is invoked within position_main.cpp.

I'm less sure this is the reason then based on your experience, but the problem only appears when calling multiple odgi positions in parallel and never occurs when running multiple odgi position calls sequentially, so I still think the issue is with different calls racing to the "/cluster/work/alex/KIT/.pathnames.iv" file.

— Reply to this email directly, view it on GitHubhttps://github.com/pangenome/odgi/issues/493#issuecomment-1505306068, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO26XHRA3NSGYG7HIHZIYNLXA2WVRANCNFSM6AAAAAAW3F3BTQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

subwaystation commented 1 year ago

I understand the problem, but I don't understand how it can happen :popcorn: Because I can't follow which part of odgi position should invoke code that would read or write .pathnames.iv.

subwaystation commented 1 year ago

https://github.com/vgteam/libhandlegraph/blob/900f3a8f6bf8615e8446bf63f70077d5cff5f7fa/src/serializable.cpp#L32 This is the guilty code.

subwaystation commented 1 year ago

Ah, the problem could be that you are writing to dev/stdout. Obviously all 4 runs in parallel are writing to the same device. Is this intentional?

ASLeonard commented 1 year ago

I was writing to stdout to further postprocess the output. I can try writing to separate files (based on the unique paths of interest), so that would at least test if the -o /dev/stdout is causing any issues.

I'm not sure why the files have "no name" and are just dot names in the working directory. $TMPDIR is unset (but pretty sure I saw this issue on the compute nodes where $TMPDIR is set).

subwaystation commented 1 year ago

So my current assumption is that when odgi is loading the same graph in parallel, some temporary files are created (I would not know why?) or are assumed to be exisiting, but actually do not contain any information either because:

I am not familiar enough with the libhandlegraph API, maybe @ekg or @adamnovak have some intuition here? Thanks!

AndreaGuarracino commented 1 year ago

I am confused by this command:

parallel -j 4 odgi position -i pggb.og -p HER:{1}-{2}  -E -d 1000 -o /dev/stdout ::: <paths of interest> 

-E requires a file in input and odgi position does not have the -o option. Options of parallel?