paulsengroup / hictk

Blazing fast toolkit to work with .hic and .cool files
MIT License
23 stars 1 forks source link

"attempt of out-of-bound read" when using -t option #271

Closed bskubi closed 6 days ago

bskubi commented 2 weeks ago

I've been running into errors trying to convert .mcool to .hic format using hictk. I've been trying to pin down the source of the error. As far as I can tell, it's using the -t option, as opposed to accepting the default.

Using this: hictk convert -t 2 [prefix].mcool [prefix].hic

Results in the error:

[2024-09-30 09:33:54.606] [info]: ingesting pixels at 3150599 pixels/s...
[2024-09-30 09:33:57.458] [info]: writing header at offset 0
[2024-09-30 09:33:57.458] [info]: begin writing interaction blocks to file "test.hic"...
[2024-09-30 09:33:57.458] [info]: [1000 bp] writing pixels for chr1:chr1 matrix at offset 4992...
FAILURE! hictk convert encountered the following error: an error occurred while writing file "test.hic": an error occurred while writing the chr1:chr1 matrix at 1000 resolution to file "test.hic": an error occurred while writing pixels for chr1:chr1 to file "test.hic": an error occurred while writing interaction blocks using a single thread: caught an attempt of out-of-bound read

(I've also seen it associated with other errors that unfortunately I seem to have deleted and that are hard to reproduce. The one above is the most common.)

This:

hictk convert [prefix].mcool [prefix].hic

Leads to successful file conversion on my local PC. However, on SLURM using a workflow parallelized by Nextflow, the option with or without the -t argument leads to the above failure.

robomics commented 2 weeks ago

Thanks for opening this issue!

Very interesting... It looks like some kind of data race inside the HiCFileWriter. I can reproduce the issue only when using exactly 2 threads (which is odd, because in that case there should only be one thread in charge of writing data to the HiCFileWriter, and nothing should be reading from it).

I will investigate a bit and let you know once I know more.

bskubi commented 2 weeks ago

Awesome, thank you very much for your quick response. FYI, I get the same error when hictk convert is run on four files in parallel, on SLURM, using singularity with hictk installed, for mcool -> hic conversion, using all default arguments (hictk convert [prefix].mcool [prefix].hic), or when -t is specified with 2, 4, 10, 15, or 20 threads. This is in the context of a Nextflow workflow.

The error is eliminated if I enforce that Nextflow only launch one conversion at a time. But if I permit it to launch two file conversions simultaneously (i.e. two independent instances of hictk convert using two different .mcool files as input), then the first one to finish ingesting pixels and start writing interaction blocks crashes with this error, leading to the entire workflow to crash.

Oddly, however, if I run more than one instance of hictk with no -t option specified on my local machine but using & after the command in order to have the command be non-blocking, then I am able to successfully launch multiple conversion processes simultaneously.

I'm really hoping to bundle hictk with my workflow, so I'm hoping this can be resolved! Thanks for your work, this is a needed tool.

robomics commented 1 week ago

I am glad to hear you've found hictk to be useful!

The error you are describing when running multiple instances of hictk convert in parallel on the same system could be a different issue.

Can you try to run hictk convert with --tmpdir=./tmp (so that each instance of hictk generates temporary files in its own Nextflow working dir)?

This is just to rule out the possibility that the issue is due to path collisions when generating random paths for the temporary files. This used to be an issue before hictk v1.0.0. However, https://github.com/paulsengroup/hictk/pull/152 should be enough to make collisions of temporary file paths extremely rare (basically impossible without malicious intent).

bskubi commented 1 week ago

Yeah I'm giving it a try now on -t 20 using a local PC with a .sh script that launches both conversions simultaneously into different tmp directories. I tried this with four conversions recently using default threads and no explicit tmp directory. Two conversions were launched at the same time by ending the command with & to make the command non-blocking, and two were launched sequentially by leaving the & out. The sequentially launched ones ran successfully, the simultaneous ones failed. So that does seem to fit your hypothesis about path collisions. I'll let you know how this goes.

bskubi commented 1 week ago

Can confirm that running with -t 8 and two simultaneous processes with different explicitly specified tmpdir values results in a successful conversion.

robomics commented 1 week ago

Thanks for running the tests!

I was able to reproduce the bug with something like:

for i in $(seq 10); do
    hictk convert test/data/integration_tests/4DNFIZ1ZVXC8.mcool::/resolutions/1000 /tmp/test.$i.hic --threads 3 --force |& tee /tmp/log.$i.txt > /dev/null &;
done
wait

I've opened https://github.com/paulsengroup/hictk/pull/274 to fix the (very silly) mistake that lead to this bug.

I've tested this change on a linux machine, and this seems to address the bug independently of the number of CPU cores used (which makes sense given the nature of the bug).

Here are the x86 containers built from the branch with the bugfix:

If you have time, could you try to see if the proposed fix solves the issue on your end?

robomics commented 6 days ago

I just merged the fix (the updated container images should be ready in 1-2h). Feel free to re-open this ticket if the issue still persists on your end :)