Closed tarak77 closed 5 years ago
Hey Tarak,
Sorry this repo doesn't have this functionality for now; but below is a possible, step-by-step work-around:
Choose the chromosomes you want from color/hg19.chr.len
to form a new file subset.chr.len
like this:
1 249250621
2 243199373
Choose contacts from the chromosomes you want from your .con
file (for example, dedup.con.gz
). This can be achieved by making a file subset.reg
first:
1 . . .
2 . . .
Then you can form a new file subset.con.gz
with dip-c reg
:
dip-c reg -i subset.reg dedup.con.gz | gzip > subset.con.gz
Finally, you can use dip-c bincon
with these modified files to get a sub-matrix:
dip-c bincon -b 5000000 -H -l subset.chr.len subset.con.gz > subset.bincon.txt
dip-c bincon -i -b 5000000 -H -l subset.chr.len . > subset.bincon.info
I haven't tested the above; but it should in principle work. Alternatively, you can convert to .hic
and use the "dump" functionality of Juicebox.
Okay. I am trying for an intrachromosomal matrix for chr1
I have a file chr1.len
1 249250621
Then a chr1.con file
1,3004053,. 1,7413719,0
1,3008604,0 1,8842171,.
1,3008933,. 1,3828523,.
1,3008983,. 1,3355303,.
1,3009311,0 1,8808171,0
1,3010718,. 1,3647611,1
1,3011265,1 1,8148555,1
1,3017050,0 1,6084006,.
1,3017383,0 1,5215240,.
1,3020906,1 1,12695923,.
1,3033864,0 1,6780424,.
1,3034715,0 1,3037475,.
1,3037170,1 1,8142295,1
1,3038175,0 1,3097299,.
1,3042903,. 1,3798926,0
1,3043055,. 1,3455626,.
.
.
.
While using dip-c reg, it seems to give error though
Traceback (most recent call last):
File "./dip-c", line 122, in
Any help?
Can I see your command and your .reg
file? Both the .len
and .reg
files need to be tab-delimited.
Btw if you already have all chr1 contacts in a file chr1.con
, you don't need to run dip-c reg
and can proceed directly to dip-c bincon
.
I didn't understand how to make a.reg
file.
So since I want a matrix for chr1, I manually select the chr1 legs from contacts.con
file to produce chr1.con
. Note that I remove the chr
string as you mentioned.
(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c bincon -b 100000 -H -l ./test_cells/deepvariant_testing/f_4cse-8/chr1.len ./test_cells/deepvariant_testing/f_4cse-8/chr1.con > ./test_cells/deepvariant_testing/f_4cse-8/chr1.bincon.txt
But even after using chr1.con.gz file with bincon, I am getting error
(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c bincon -b 100000 -H -l ./test_cells/deepvariant_testing/f_4cse-8/chr1.len ./test_cells/deepvariant_testing/f_4cse-8/chr1.contacts.con.gz > ./test_cells/deepvariant_testing/f_4cse-8/chr1.bincon.txt
Traceback (most recent call last):
File "./dip-c", line 122, in <module>
main()
File "./dip-c", line 55, in main
return_value = bincon.bincon(sys.argv[1:])
File "/Users/tarakshisode/dip-c/bincon.py", line 92, in bincon
ref_name, ref_len = chr_len_file_line.strip().split("\t")
ValueError: need more than 1 value to unpack
It seems your chr1.len
file is space-delimited rather than tab-delimited.
Oh ok. I am trying to use the .reg
file format but get the error
(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c reg -i ./test_cells/deepvariant_testing/f_4cse-8/chr1.reg ./test_cells/deepvariant_testing/f_4cse-8/contacts.con.gz | gzip > ./test_cells/deepvariant_testing/f_4cse-8/chr1.con.gz
[M::reg] only include the following regions:
chr hap start end
1 . . .
[M::reg] only exclude the following regions:
chr hap start end
[M::reg] assume haploid in the following regions:
chr hap start end
[M::reg] read 444328 putative contacts (41.58% legs phased)
Traceback (most recent call last):
File "./dip-c", line 122, in <module>
main()
File "./dip-c", line 49, in main
return_value = reg.reg(sys.argv[1:])
File "/Users/tarakshisode/dip-c/reg.py", line 108, in reg
sys.stderr.write("[M::" + __name__ + "] writing output for " + str(con_data.num_cons()) + " putative contacts (" + str(round(100.0 * con_data.num_intra_chr() / con_data.num_cons(), 2)) + "% intra-chromosomal, " + str(round(100.0 * con_data.num_phased_legs() / con_data.num_cons() / 2, 2)) + "% legs phased)\n")
ZeroDivisionError: float division by zero
My guess is that your contacts.con.gz
file has chromosome naming like chr1
rather than 1
? If the problem persists, can I see your contacts.con.gz
file?
I see, it fixed the error.
But when I run dip-c bincon
, I get an error
(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c bincon -b 100000 -H -l ./test_cells/deepvariant_testing/f_4cse-8/chr1.len ./test_cells/deepvariant_testing/f_4cse-8/chr1.con.gz > ./test_cells/deepvariant_testing/f_4cse-8/chr1.bincon.100kb.txt
[M::bincon] read 33649 putative contacts (100.0% intra-chromosomal, 43.04% legs phased)
[M::bincon] generating a 2494 x 2494 matrix
Traceback (most recent call last):
File "./dip-c", line 122, in <module>
main()
File "./dip-c", line 55, in main
return_value = bincon.bincon(sys.argv[1:])
File "/Users/tarakshisode/dip-c/bincon.py", line 118, in bincon
matrix_data = con_data_to_matrix(con_data, hom_offsets, matrix_bin_size, matrix_size, merge_haplotypes, display_num_cons)
File "/Users/tarakshisode/dip-c/bincon.py", line 32, in con_data_to_matrix
matrix_index_1, matrix_index_2 = con_to_matrix_index(con, hom_offsets, matrix_bin_size, merge_haplotypes)
File "/Users/tarakshisode/dip-c/bincon.py", line 19, in con_to_matrix_index
matrix_index_1 = leg_to_matrix_index(con.leg_1(), hom_offsets, matrix_bin_size, merge_haplotypes)
File "/Users/tarakshisode/dip-c/bincon.py", line 16, in leg_to_matrix_index
return hom_offsets[hom_name] + int(round(float(ref_locus) / matrix_bin_size))
KeyError: '1(pat)'
Will it be possible to define chr1 instead of 1 throughout? I think it could fix the bincon error?
Can I have your chr1.len
and chr1.con.gz
?
Yes, you can certainly use naming like chr1
throughout.
Okay ill use chr1 for all files then.
Here are the plots produced using dip-c and juicer tools at 500kb
Will it be possible to produce matrices for individual homologs using the steps you mentioned by replacing chr1
bychr1(mat)
or chr1(pat)
?
I ask that because I want to compare the input matrices with simulated contact matrix generated from 3d coordinates using power law method. It will make sense to compare matrices of homologs individually. Does it make sense to do so?
Looks great! To separate homologs, you can simply use dip-c bincon
without the -H
option; however, you must make sure that all contact legs have known haplotypes (either 0
or 1
). It doesn't seem to be the case for your file, where many legs have .
as the haplotype.
I see, so if I want a contact map for just chr1(mat):
I define a .len
file as
chr1(mat) 195471971
A .reg
file as
chr1(mat) . . .
While getting the chr1.mat.con
file I get an error
(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c reg -i ./test_cells/deepvariant_testing/f_4cse-8/chr1.mat.reg ./test_cells/deepvariant_testing/f_4cse-8/impute3.round2.con.gz | gzip > ./test_cells/deepvariant_testing/f_4cse-8/chr1.mat.con.gz
[M::reg] only include the following regions:
chr hap start end
chr1(mat) . . .
[M::reg] only exclude the following regions:
chr hap start end
[M::reg] assume haploid in the following regions:
chr hap start end
[M::reg] read 286213 putative contacts (100.0% legs phased)
Traceback (most recent call last):
File "./dip-c", line 122, in <module>
main()
File "./dip-c", line 49, in main
return_value = reg.reg(sys.argv[1:])
File "/Users/tarakshisode/dip-c/reg.py", line 108, in reg
sys.stderr.write("[M::" + __name__ + "] writing output for " + str(con_data.num_cons()) + " putative contacts (" + str(round(100.0 * con_data.num_intra_chr() / con_data.num_cons(), 2)) + "% intra-chromosomal, " + str(round(100.0 * con_data.num_phased_legs() / con_data.num_cons() / 2, 2)) + "% legs phased)\n")
ZeroDivisionError: float division by zero
Sorry about the confusion. You need to run dip-c bincon
with the same files as before (when -H
is on), but without -H
. The output matrix will now be a larger matrix with 4 sub-matrices. You can separate the sub-matrices manually afterwards, based on the column information from dip-c bincon -i
.
Hey Tan,
Is there a way to generate contact matrices based on chromosomes? Say if I want a matrix showing just chr1 or if I want to show chr1,chr2