Open ASLeonard opened 1 year ago
Hi @ASLeonard,
the code link you sent is not run in parallel, as far as I can see. The construction of the XP index is single threaded, with the exception being the https://github.com/ekg/mmmulti we use. Furthermore, which exact command did you use? Which exact error did you get? Without further information I don't know how to help you. Thanks!
Sorry for not being clear. This is the command I was running something like
parallel -j 4 odgi position -i pggb.og -p HER:{1}-{2} -E -d 1000 -o /dev/stdout ::: <paths of interest>
So the parallelism is running multiple calls of odgi position
simultaneously, not running odgi parallel with multiple threads.
And this is one of the errors when running in the directory /cluster/work/alex/KIT
terminate called after throwing an instance of 'std::logic_error'
what(): Error: File "/cluster/work/alex/KIT/.pathnames.iv" contains zero symbol.
warning [libhandlegraph]: Serialized object does not appear to match deserialzation type.
warning [libhandlegraph]: It is either an old version or in the wrong format.
warning [libhandlegraph]: Attempting to load it anyway. Future releases will reject it!
terminate called after throwing an instance of 'std::runtime_error'
what(): Error rewinding to load non-magic-prefixed SerializableHandleGraph
Since there are multiple instances of odgi position
running at the same time in the same directory, all calls are seemingly reading/writing to the same "hardcoded" filename of "/cluster/work/alex/KIT/.pathnames.iv"
I see. From what I understand the problem is not originating from the XP index, because here we generate the file paths randomly. Also non of the code of the XP index is invoked within position_main.cpp
.
However, warning [libhandlegraph]:
hints that something goes wrong when accessing the input ODGI file? Note sure.
Hopefully @ekg knows more.
Which version of ODGI are you using @ASLeonard ?
I built odgi from tip (a054641).
From what I understand the problem is not originating from the XP index, because here we generate the file paths randomly. Also non of the code of the XP index is invoked within position_main.cpp.
I'm less sure this is the reason then based on your experience, but the problem only appears when calling multiple odgi positions
in parallel and never occurs when running multiple odgi position
calls sequentially, so I still think the issue is with different calls racing to the "/cluster/work/alex/KIT/.pathnames.iv" file.
We could fix the problem by generating files with random names. Strangely, your files have "no name", that is they have names starting with a dot.
From: Alex Leonard @.> Sent: Wednesday, April 12, 2023 3:44:56 PM To: pangenome/odgi @.> Cc: Subscribed @.***> Subject: Re: [pangenome/odgi] Race condition in ".pathnames.iv" file in position command (Issue #493)
I built odgi from tip (a054641https://github.com/pangenome/odgi/commit/a054641b220db1b5754d8cb1e298e96f98be439f).
From what I understand the problem is not originating from the XP index, because here we generate the file paths randomly. Also non of the code of the XP index is invoked within position_main.cpp.
I'm less sure this is the reason then based on your experience, but the problem only appears when calling multiple odgi positions in parallel and never occurs when running multiple odgi position calls sequentially, so I still think the issue is with different calls racing to the "/cluster/work/alex/KIT/.pathnames.iv" file.
— Reply to this email directly, view it on GitHubhttps://github.com/pangenome/odgi/issues/493#issuecomment-1505306068, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO26XHRA3NSGYG7HIHZIYNLXA2WVRANCNFSM6AAAAAAW3F3BTQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>
I understand the problem, but I don't understand how it can happen :popcorn:
Because I can't follow which part of odgi position
should invoke code that would read or write .pathnames.iv
.
Ah, the problem could be that you are writing to dev/stdout
. Obviously all 4 runs in parallel are writing to the same device. Is this intentional?
I was writing to stdout to further postprocess the output. I can try writing to separate files (based on the unique paths of interest), so that would at least test if the -o /dev/stdout
is causing any issues.
I'm not sure why the files have "no name" and are just dot names in the working directory. $TMPDIR is unset (but pretty sure I saw this issue on the compute nodes where $TMPDIR is set).
So my current assumption is that when odgi
is loading the same graph in parallel, some temporary files are created (I would not know why?) or are assumed to be exisiting, but actually do not contain any information either because:
I am not familiar enough with the libhandlegraph
API, maybe @ekg or @adamnovak have some intuition here? Thanks!
I am confused by this command:
parallel -j 4 odgi position -i pggb.og -p HER:{1}-{2} -E -d 1000 -o /dev/stdout ::: <paths of interest>
-E
requires a file in input and odgi position
does not have the -o
option. Options of parallel
?
Hi, Probably not a common case, but querying positions in a graph in an outer parallel loop is buggy, because of this line writing a local pathname file with a fixed name.
https://github.com/pangenome/odgi/blob/2c9a17f0ce1f72382c92b0c84cc334f0d05b7c48/src/algorithms/xp.cpp#L113
Running
parallel -j 1 ...
fixed the issue for me, so pretty sure it is exactly due to a race condition on this line. It seems there is another similar case here.Best, Alex