medvedevgroup / SibeliaZ

A fast whole-genome aligner based on de Bruijn graphs
http://medvedevgroup.com/
Other
140 stars 19 forks source link

sibeliaz :: spoa call code duplication #7

Closed EricDeveaud closed 5 years ago

EricDeveaud commented 5 years ago

Hello,

seems to me that sibeliaz code contains duplication code, in the align funtion.

find $outdir/blocks -name "*.fa" -printf "$PWD/%p\n" | xargs -I @ -P "$threads" bash -c "align @ '$outfile' '$DIR'"
find $outdir/blocks -name "*.fa" -printf "$PWD/%p\n" | xargs -I @ -P 1 bash -c "align @ '$outfile' '$DIR'"

looks like it performs the spoa alignment twice first with xargs $threads process, then with just on process

I don't see the point of removing successfull spoa jobs input file, and performing a new run with remaining inputfiles.

regards

Eric

iminkin commented 5 years ago

H @EricDeveaud,

Thanks for paying such close attention to the code :) The reason for this duplication is that the first line tries to align as many blocks as possible in parallel mode. Unfortunately, the global aligner spoa is quite memory hungry, so sometimes running a bunch of those jobs in parallel results in using up too much memory and they crash. So the second line is for realigning those crashed jobs using just one process at a time to ensure there is as much memory for it as possible.

It is a quite awkward solution but I am not aware of any better. I will probably replace xargs with GNU parallel in the next release as it allows for better memory control:

https://unix.stackexchange.com/questions/500959/prevent-the-machine-being-slowed-down-by-running-out-of-memory?noredirect=1#comment923337_500959