tanlongzhi / dip-c

Tools to analyze Dip-C (or other 3C/Hi-C) data
61 stars 18 forks source link

Similar to issue #12 #23

Closed tarak77 closed 5 years ago

tarak77 commented 5 years ago

Hey Tan,

Is there a way to generate contact matrices based on chromosomes? Say if I want a matrix showing just chr1 or if I want to show chr1,chr2

tanlongzhi commented 5 years ago

Hey Tarak,

Sorry this repo doesn't have this functionality for now; but below is a possible, step-by-step work-around:

  1. Choose the chromosomes you want from color/hg19.chr.len to form a new file subset.chr.len like this:

    1   249250621
    2   243199373
  2. Choose contacts from the chromosomes you want from your .con file (for example, dedup.con.gz). This can be achieved by making a file subset.reg first:

    1   .   .   .
    2   .   .   .

    Then you can form a new file subset.con.gz with dip-c reg:

    dip-c reg -i subset.reg dedup.con.gz | gzip > subset.con.gz
  3. Finally, you can use dip-c bincon with these modified files to get a sub-matrix:

    dip-c bincon -b 5000000 -H -l subset.chr.len subset.con.gz > subset.bincon.txt
    dip-c bincon -i -b 5000000 -H -l subset.chr.len . > subset.bincon.info

I haven't tested the above; but it should in principle work. Alternatively, you can convert to .hic and use the "dump" functionality of Juicebox.

tarak77 commented 5 years ago

Okay. I am trying for an intrachromosomal matrix for chr1

  1. I have a file chr1.len 1 249250621

  2. Then a chr1.con file

    
    1,3004053,. 1,7413719,0
    1,3008604,0 1,8842171,.
    1,3008933,. 1,3828523,.
    1,3008983,. 1,3355303,.
    1,3009311,0 1,8808171,0
    1,3010718,. 1,3647611,1
    1,3011265,1 1,8148555,1
    1,3017050,0 1,6084006,.
    1,3017383,0 1,5215240,.
    1,3020906,1 1,12695923,.
    1,3033864,0 1,6780424,.
    1,3034715,0 1,3037475,.
    1,3037170,1 1,8142295,1
    1,3038175,0 1,3097299,.
    1,3042903,. 1,3798926,0
    1,3043055,. 1,3455626,.
    .
    .
    .

While using dip-c reg, it seems to give error though

Traceback (most recent call last): File "./dip-c", line 122, in main() File "./dip-c", line 49, in main return_value = reg.reg(sys.argv[1:]) File "/Users/tarakshisode/dip-c/reg.py", line 72, in reg inc_regs.extend(file_to_reg_list(inc_file)) File "/Users/tarakshisode/dip-c/classes.py", line 1072, in file_to_reg_list reg_list.append(string_to_reg(reg_file_line.strip())) File "/Users/tarakshisode/dip-c/classes.py", line 1059, in string_to_reg ref_name, haplotype, start, end = reg_string.split("\t") ValueError: need more than 2 values to unpack



Any help?
tanlongzhi commented 5 years ago

Can I see your command and your .reg file? Both the .len and .reg files need to be tab-delimited.

Btw if you already have all chr1 contacts in a file chr1.con, you don't need to run dip-c reg and can proceed directly to dip-c bincon.

tarak77 commented 5 years ago

I didn't understand how to make a.reg file. So since I want a matrix for chr1, I manually select the chr1 legs from contacts.con file to produce chr1.con. Note that I remove the chr string as you mentioned.

(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c bincon -b 100000 -H -l ./test_cells/deepvariant_testing/f_4cse-8/chr1.len ./test_cells/deepvariant_testing/f_4cse-8/chr1.con > ./test_cells/deepvariant_testing/f_4cse-8/chr1.bincon.txt

But even after using chr1.con.gz file with bincon, I am getting error

(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c bincon -b 100000 -H -l ./test_cells/deepvariant_testing/f_4cse-8/chr1.len ./test_cells/deepvariant_testing/f_4cse-8/chr1.contacts.con.gz > ./test_cells/deepvariant_testing/f_4cse-8/chr1.bincon.txt
Traceback (most recent call last):
  File "./dip-c", line 122, in <module>
    main()
  File "./dip-c", line 55, in main
    return_value = bincon.bincon(sys.argv[1:])
  File "/Users/tarakshisode/dip-c/bincon.py", line 92, in bincon
    ref_name, ref_len = chr_len_file_line.strip().split("\t")
ValueError: need more than 1 value to unpack
tanlongzhi commented 5 years ago

It seems your chr1.len file is space-delimited rather than tab-delimited.

tarak77 commented 5 years ago

Oh ok. I am trying to use the .reg file format but get the error

(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c reg -i ./test_cells/deepvariant_testing/f_4cse-8/chr1.reg ./test_cells/deepvariant_testing/f_4cse-8/contacts.con.gz | gzip > ./test_cells/deepvariant_testing/f_4cse-8/chr1.con.gz
[M::reg] only include the following regions:
chr hap start   end
1   .   .   .
[M::reg] only exclude the following regions:
chr hap start   end
[M::reg] assume haploid in the following regions:
chr hap start   end
[M::reg] read 444328 putative contacts (41.58% legs phased)
Traceback (most recent call last):
  File "./dip-c", line 122, in <module>
    main()
  File "./dip-c", line 49, in main
    return_value = reg.reg(sys.argv[1:])
  File "/Users/tarakshisode/dip-c/reg.py", line 108, in reg
    sys.stderr.write("[M::" + __name__ + "] writing output for " + str(con_data.num_cons()) + " putative contacts (" + str(round(100.0 * con_data.num_intra_chr() / con_data.num_cons(), 2)) + "% intra-chromosomal, " + str(round(100.0 * con_data.num_phased_legs() / con_data.num_cons() / 2, 2)) + "% legs phased)\n")
ZeroDivisionError: float division by zero
tanlongzhi commented 5 years ago

My guess is that your contacts.con.gz file has chromosome naming like chr1 rather than 1? If the problem persists, can I see your contacts.con.gz file?

tarak77 commented 5 years ago

I see, it fixed the error. But when I run dip-c bincon, I get an error

(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c bincon -b 100000 -H -l ./test_cells/deepvariant_testing/f_4cse-8/chr1.len ./test_cells/deepvariant_testing/f_4cse-8/chr1.con.gz > ./test_cells/deepvariant_testing/f_4cse-8/chr1.bincon.100kb.txt
[M::bincon] read 33649 putative contacts (100.0% intra-chromosomal, 43.04% legs phased)
[M::bincon] generating a 2494 x 2494 matrix
Traceback (most recent call last):
  File "./dip-c", line 122, in <module>
    main()
  File "./dip-c", line 55, in main
    return_value = bincon.bincon(sys.argv[1:])
  File "/Users/tarakshisode/dip-c/bincon.py", line 118, in bincon
    matrix_data = con_data_to_matrix(con_data, hom_offsets, matrix_bin_size, matrix_size, merge_haplotypes, display_num_cons)
  File "/Users/tarakshisode/dip-c/bincon.py", line 32, in con_data_to_matrix
    matrix_index_1, matrix_index_2 = con_to_matrix_index(con, hom_offsets, matrix_bin_size, merge_haplotypes)
  File "/Users/tarakshisode/dip-c/bincon.py", line 19, in con_to_matrix_index
    matrix_index_1 = leg_to_matrix_index(con.leg_1(), hom_offsets, matrix_bin_size, merge_haplotypes)
  File "/Users/tarakshisode/dip-c/bincon.py", line 16, in leg_to_matrix_index
    return hom_offsets[hom_name] + int(round(float(ref_locus) / matrix_bin_size))
KeyError: '1(pat)'

Will it be possible to define chr1 instead of 1 throughout? I think it could fix the bincon error?

tanlongzhi commented 5 years ago

Can I have your chr1.len and chr1.con.gz?

Yes, you can certainly use naming like chr1 throughout.

tarak77 commented 5 years ago

Archive.zip

Okay ill use chr1 for all files then.

tarak77 commented 5 years ago

Here are the plots produced using dip-c and juicer tools at 500kb

screenshot 2019-01-24 at 8 52 01 pm screenshot 2019-01-24 at 4 41 16 pm

Will it be possible to produce matrices for individual homologs using the steps you mentioned by replacing chr1 bychr1(mat) or chr1(pat)?

I ask that because I want to compare the input matrices with simulated contact matrix generated from 3d coordinates using power law method. It will make sense to compare matrices of homologs individually. Does it make sense to do so?

tanlongzhi commented 5 years ago

Looks great! To separate homologs, you can simply use dip-c bincon without the -H option; however, you must make sure that all contact legs have known haplotypes (either 0 or 1). It doesn't seem to be the case for your file, where many legs have . as the haplotype.

tarak77 commented 5 years ago

I see, so if I want a contact map for just chr1(mat):

  1. I define a .len file as chr1(mat) 195471971

  2. A .reg file as chr1(mat) . . .

  3. While getting the chr1.mat.con file I get an error

    (py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c reg -i ./test_cells/deepvariant_testing/f_4cse-8/chr1.mat.reg ./test_cells/deepvariant_testing/f_4cse-8/impute3.round2.con.gz | gzip > ./test_cells/deepvariant_testing/f_4cse-8/chr1.mat.con.gz
    [M::reg] only include the following regions:
    chr hap start   end
    chr1(mat)   .   .   .
    [M::reg] only exclude the following regions:
    chr hap start   end
    [M::reg] assume haploid in the following regions:
    chr hap start   end
    [M::reg] read 286213 putative contacts (100.0% legs phased)
    Traceback (most recent call last):
    File "./dip-c", line 122, in <module>
    main()
    File "./dip-c", line 49, in main
    return_value = reg.reg(sys.argv[1:])
    File "/Users/tarakshisode/dip-c/reg.py", line 108, in reg
    sys.stderr.write("[M::" + __name__ + "] writing output for " + str(con_data.num_cons()) + " putative contacts (" + str(round(100.0 * con_data.num_intra_chr() / con_data.num_cons(), 2)) + "% intra-chromosomal, " + str(round(100.0 * con_data.num_phased_legs() / con_data.num_cons() / 2, 2)) + "% legs phased)\n")
    ZeroDivisionError: float division by zero
tanlongzhi commented 5 years ago

Sorry about the confusion. You need to run dip-c bincon with the same files as before (when -H is on), but without -H. The output matrix will now be a larger matrix with 4 sub-matrices. You can separate the sub-matrices manually afterwards, based on the column information from dip-c bincon -i.