tariks / peakachu

Genome-wide contact analysis using sklearn
MIT License
57 stars 9 forks source link

Peakachu does not work with .hic files generated with juicertools_2.20.00 #23

Closed varunkas closed 10 months ago

varunkas commented 10 months ago

I get the following output when running peakachu with .hic files generated using juicertools_2.20.00

$ peakachu depth -p DNMT3Amut55.allValidPairs.hic

num of intra reads in your data: 0
Traceback (most recent call last):
  File "/work/sreenivasan/hic_conda_envs_nextflow/peakachu-df5fa75fff3dd22f06b1b830163dfb8c/bin/peakachu", line 91, in <module>
    run()
  File "/work/sreenivasan/hic_conda_envs_nextflow/peakachu-df5fa75fff3dd22f06b1b830163dfb8c/bin/peakachu", line 87, in run
    args.func(args)
  File "/work/sreenivasan/hic_conda_envs_nextflow/peakachu-df5fa75fff3dd22f06b1b830163dfb8c/lib/python3.10/site-packages/peakachu/calculate_depth.py", line 46, in main
    matched_read_num = 3031042417 / genome_size * totals
ZeroDivisionError: division by zero

Is there a workaround? Peakachu works fine for .hic files generated using juicertools_1.22.01 for example. Thanks!

tariks commented 10 months ago

It sounds like juicer changed its hic format in some update between 1.22.01 and 2.20.00; I would investigate further, but haven't the time unfortunately. It's possible our current parser needs updating.

The workaround would be to identify the format differences then write a short script in your preferred language; it could take STDIN as input and adjust the format. Awk / sed / cut are great for this, otherwise string manipulations in Python would do the trick. Downside is converting .hic to text files as intermediate steps which means mucho disk space and an extra failure point.

XiaoTaoWang commented 10 months ago

@varunkas Hi, a simple solution is to convert your .hic files to .cool/.mcool files, and use the converted .cool/.mcool files as input to Peakachu. You may try hic2cool (https://github.com/4dn-dcic/hic2cool) first. If hic2cool does not work, you can use my another software HiCLift for this conversion (https://github.com/XiaoTaoWang/HiCLift#usage).

varunkas commented 10 months ago

Thank you so much for your suggestions. I will give it a try and let you know.

varunkas commented 10 months ago

OK, it seems like hic2cool does not work:

$ hic2cool convert -r 10000 -w "DNMT3Amut55.allValidPairs.hic" "DNMT3Amut55.allValidPairs.cool"
Traceback (most recent call last):
  File "/Users/sreenivasan/miniforge3/envs/hic_py_straw/bin/hic2cool", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/sreenivasan/miniforge3/envs/hic_py_straw/lib/python3.12/site-packages/hic2cool/__main__.py", line 86, in main
    hic2cool_convert(args.infile, args.outfile, args.resolution, args.nproc, args.warnings, args.silent)
  File "/Users/sreenivasan/miniforge3/envs/hic_py_straw/lib/python3.12/site-packages/hic2cool/hic2cool_utils.py", line 860, in hic2cool_convert
    pair_footer_info, expected, factors, norm_info = read_footer(req, mmap_buf, masteridx)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sreenivasan/miniforge3/envs/hic_py_straw/lib/python3.12/site-packages/hic2cool/hic2cool_utils.py", line 131, in read_footer
    unit = readcstr(f)
           ^^^^^^^^^^^
  File "/Users/sreenivasan/miniforge3/envs/hic_py_straw/lib/python3.12/site-packages/hic2cool/hic2cool_utils.py", line 59, in readcstr
    return buf.decode("utf-8")
           ^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
XiaoTaoWang commented 10 months ago

then please give HiCLift a try :)

varunkas commented 10 months ago

Dear @XiaoTaoWang Thank you so much for pointing me to your HiCLift toolkit. Very easy to use indeed and it worked! I succesfully converted juicebox *.hic file to .mcool file, which was accepted by peakachu to call peaks. :)

I have a naive question (I am new to HiC analysis) - what is the purpose of requiring --out-chromsizes when I am only converting formats? Is it for binning/normalization purpose? And it is it not possible to infer it from the input Juicer HiC file? Thanks!

P.S. I really like the verbosity during the conversion - good to see what it is doing :)

XiaoTaoWang commented 10 months ago

Sometimes the order of chromosomes stored in your input .hic/.mcool files are not as expected (e.g., chr1, chr10, chr11, ...). By using --out-chromsizes, you can reorder the chromosomes as what you want (e.g., chr1, chr2, chr3, ...), and as a result, the contact matrices in your output .hic/mcool files will also be reorganized accordingly.