parklab / HiNT

HiC for copy Number variation and Translocation detection
MIT License
35 stars 8 forks source link

Error running CNV from .hic files with hg38 #21

Open ksmetz opened 2 years ago

ksmetz commented 2 years ago

Hello,

Thank you so much for all of your hard work developing this wonderful tool! I have run HiNT successfully on the test data provided (hg19), but encountered the following error when trying to run it on my own data (hg38).

The command

hint cnv -m /data/test.hic \
   -f juicer \
   --refdir /data/HiNT_ref/refData/hg38 \
   -r 50 \
   -g hg38 \
   -n TEST \
   --bicseq /usr/local/apps/bicseq2/0.7.3/ \
   -e HindIII \
   -o /data/TEST_CNV

The error

From log.out:

[12:57:50] Argument List: 
[12:57:50] Hi-C contact matrix = /data/test.hic
[12:57:50] Hi-C contact matrix format = juicer
[12:57:50] resolution = 50 kb
[12:57:50] Genome = hg38
[12:57:50] BICseq directory = /usr/local/apps/bicseq2/0.7.3/
[12:57:50] Name = TEST
[12:57:50] Output directory = /data/TEST_CNV
HiC version:  8
One of the chromosomes wasn't found in the file. Check that the chromosome name matches the genome.

From log.err:

Traceback (most recent call last):
  File "/usr/local/apps/hint/2.2.7/bin/hint", line 201, in <module>
    main()
  File "/usr/local/apps/hint/2.2.7/bin/hint", line 194, in main
    cnvrun(argparser)
  File "/usr/local/Anaconda/envs_app/hint/2.2.7/lib/python3.6/site-packages/HiNT/runhint.py", line 79, in cnvrun
    rowSumFilesInfo = getGenomeRowSums(opts.resolution, opts.matrixfile, chromlf, opts.outdir,opts.name)
  File "/usr/local/Anaconda/envs_app/hint/2.2.7/lib/python3.6/site-packages/HiNT/getGenomeRowSumsFromHiC.py", line 69, in getGenomeRowSums
    sumInfo = getSumPerChrom(i, j, hicfile, binsize, chroms, chromInfo, sumInfo)
  File "/usr/local/Anaconda/envs_app/hint/2.2.7/lib/python3.6/site-packages/HiNT/getGenomeRowSumsFromHiC.py", line 20, in getSumPerChrom
    result = straw('NONE', hicfile, str(chr1), str(chr2), 'BP', binsize)
  File "/usr/local/Anaconda/envs_app/hint/2.2.7/lib/python3.6/site-packages/HiNT/straw.py", line 471, in straw
    master=list1[0]
TypeError: 'int' object is not subscriptable

Potential issue

It seems that the issue is happening due to lines 18-19 in the getGenomeRowSumsFromHiC.py script. These lines trim the "chr" string from the chromosome names before passing them to the straw function.

However, for hg38 (at least for the .hic file I am working with), straw will only work when the "chr" string is included. For example, straw("NONE", "/data/test.hic", "chr1", "chr1", "BP", 50000) works and returns data, while straw("NONE", "/data/test.hic", "1", "1", "BP", 50000) does not work and returns the same error as seen when launching HiNT CNV.

Possible solution?

One solution would be to remove these lstrip functions from the script. However, this might cause issues for other genome builds (i.e. hg19). If these chromosome names are being taken from the hg19.len and hg38.len files, then this solution could still work with hg19 by just removing the "chr" strings there, although I am not sure if that would affect other steps.

I completely understand if that is too disruptive of a change to make. I wanted to still post this regardless in case any other users are experiencing similar difficulties.