lh3 / hickit

TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-C
100 stars 11 forks source link

Obtaining genome wide contact matrix file in .npy/related format? #9

Open tarak77 opened 5 years ago

tarak77 commented 5 years ago

Hello @lh3 @tanlongzhi ,

I know that we can convert .con to .hic file in the dip-c repository, but Is there a way to get the genome wide contact matrix in .npy or related matrix format at different resolutions? say 250kb, using the impute.pairs file as input(be it for my haploid cell or diploid cell). Though juicer tools have dump command to extract chromosome wise contact matrix as .txt file, we cannot extract the entire genowide contact matrix.

Any help will be awesome!

tanlongzhi commented 5 years ago

Hi Tarak,

In dip-c, you can generate a genome-wide matrix with dip-c bincon, and then generate an accompanying bin description with dip-c bincon -i. However, this function is not thoroughly tested, and can be slow and memory-consuming if the bin size -b INT is set too small.

tarak77 commented 5 years ago

Cool, is there a similar command in dip-c repo that I could refer, just to know the formatting and what input file to give?

tanlongzhi commented 5 years ago

I was talking about the dip-c repo. Because you're not the only one asking, I will probably update the dip-c README for more instructions.

tanlongzhi commented 5 years ago

The dip-c README has been updated with instructions (tanlongzhi/dip-c@8dae2a82816305e50714a7de20ddb8d3cd053e30).

wuheng9 commented 3 years ago

Hi Dr. Tan,

I used the hickit to process my dip-c data. but I got 6709 contact pairs. Is it enough.

Later, I use the "hickit -i in.pairs -u -o imput.pairs -Sr1m -c1 -r10m -c5 -b4m -b1m -b200k -D5 -b50k -D5 -b20k -O out.3dg" to analyze data and got almost no results in out.3dg file.

I attach the contacts.pairs file that I used and the out.3dg file. Please help me find out what's the issue is.

tanlongzhi commented 3 years ago

Hi @wuheng9,

7 k contacts are too few for successful 3D reconstruction. In our original study, we had ~1 m per cell. I imagine the algorithm might work with fewer (e.g. 200 k) for low-resolution reconstruction.

I looked at the files that you sent via email. It seems that your contacts are mostly very short-range -- 90% of your intra-chromosomal contacts are within 100 kb.

Assuming your sequencing data had sufficient depth, the lack of contacts is probably caused by insufficient restriction digestion. We have seen something similar in challenging cell types, in particular mammalian sperm -- do you think that's what happened?

Btw, to know whether your digestion worked at all, you can visually examine a few contact locations in your BAM file, and see if they coincide with the expected restriction site (e.g. GATC for MboI and DpnII).

Best, Tan

wuheng9 commented 3 years ago

Hi Dr. Tan,

Thank you for your suggestion. Yes, I now realize that the few contact pairs might be caused by insufficient restriction digestion. I will digestion cells and construct sequencing library again. I will check the digestion enzyme activities.

Thank you again!

Best ,

Qian Wang

At 2020-10-19 02:19:55, "Longzhi Tan" notifications@github.com wrote:

Hi @wuheng9,

7 k contacts are too few for successful 3D reconstruction. In our original study, we had ~1 m per cell. I imagine the algorithm might work with fewer (e.g. 200 k) for low-resolution reconstruction.

I looked at the files that you sent via email. It seems that your contacts are mostly very short-range -- 90% of your intra-chromosomal contacts are within 100 kb.

Assuming your sequencing data had sufficient depth, the lack of contacts is probably caused by insufficient restriction digestion. We have seen something similar in challenging cell types, in particular mammalian sperm -- do you think that's what happened?

Best, Tan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

tanlongzhi commented 3 years ago

Hi @wuheng9,

To check digestion, I typically take 5-10 uL from the digestion reaction, add PBS to a total of 100 uL, add 5 uL 0.8 U/uL ProtK (20 mg/mL; e.g. NEB P8107S), mix and incubate at 65 C for 1 h. I'll then purify with any DNA purification kit (e.g. Zymo D4013), and run a gel/Bioanalyzer. You can also do the same to check ligation.

Best, Tan

wuheng9 commented 3 years ago

Hi Dr. Tan,

Thank you very much! I noticed that the data deposited in NCBI are both short and long reads for a cell. Did you construct sequencing libraries separately or make one library and split the reads into two pools depending on read length. How did you conduct data process with the long and short reads for a cell?

I am looking forward to your reply!

Best,

Qian Wang

At 2020-10-19 09:29:26, "Longzhi Tan" notifications@github.com wrote:

Hi @wuheng9,

To check digestion, I typically take 5-10 uL from the digestion reaction, add PBS to a total of 100 uL, add 5 uL 0.8 U/uL ProtK (20 mg/mL; e.g. NEB P8107S), mix and incubate at 65 C for 1 h. I'll then purify with any DNA purification kit (e.g. Zymo D4013), and run a gel/Bioanalyzer. You can also do the same to check ligation.

Best, Tan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

tanlongzhi commented 3 years ago

Hi @wuheng9, there's only a single library for each sample -- only size-selected (by beads) differently.

You can play with different combinations; I found just a single size (0.65X for META, 0.7X for Nextera) to suffice. Basically, the lower the bead ratio, the higher the percentage of reads containing contacts (which should be proportional to the average insert size) but potentially the fewer contacts in total.

For analysis, I analyze everything together.

Best, Tan

wuheng9 commented 3 years ago

Hi Dr. Tan,

I got that. Thank you!

Best,

Qian Wang

At 2020-10-19 23:08:37, "Longzhi Tan" notifications@github.com wrote:

Hi @wuheng9, there's only a single library for each sample -- only size-selected (by beads) differently.

You can play with different combinations; I found just a single size (0.65X for META, 0.7X for Nextera) to suffice. Basically, the lower the bead ratio, the higher the percentage of reads containing contacts (which should be proportional to the average insert size) but potentially the fewer contacts in total.

For analysis, I analyze everything together.

Best, Tan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

wuheng9 commented 3 years ago

Hi Tan,

I have used the command "hickit -i impute.pairs.gz --out-png impute.png" to generate a 2D contact map in PNG. But I got a black background and no names for each chromosome. How should I change parameters to chart a figures like one that you published in Science.

I also met another issues when I tried to view 3D structure by using a command "hickit -gl -I /THL8/home/liubin/data/t.dip/SRR7226668.cpg.3dg.gz --view".

The issue is listed as following: hickit: invalid option -- 'g' hickit: invalid option -- 'l' [M::main] Version: r291 [M::main] CMD: hickit -gl -I /THL8/home/liubin/data/t.dip/SRR7226668.cpg.3dg.gz --view

So how should I view the 3D structure. And how can I see some specific region of a genome and save a png file on somewhere.

Thank you very much! I am looking forward to your reply.

Best,

Qian Wang

tanlongzhi commented 3 years ago

Hi Qian,

(1) I visualize contact maps with Juicebox:

java -Xmx2g -jar juicer_tools.jar pre -n contacts.pairs.gz contacts.hic mm10

The resulting file contacts.hic can then be visualized through Juicebox's web browser. Note that you may need to lower the color limit (by clicking the "-" button) to get a desirable color scale.

(2) To use hickit --view, you'll need to compile hickit with make gl=1 (see this section of the README). A pre-compiled version is provided in the v0.1 release (the file hickit-gl; note that -gl is part of the filename, not an option); however, note that it has not been provided for v0.1.1.

(3) For detailed visualization with PyMol, please see this section of the dip-c README. It covers basic 3D viewing. Before that, you'll need to first convert hickit .3dg files to dip-c .3dg files following dip-c's typical workflow.

To highlight a specific genomic region, you can use dip-c reg3 -i interest.reg 20k.1.clean.3dg > interest.3dg first, with a region file interest.reg like this one (for the paternal copy of Chr2:1-2Mb):

2   0   1000000 2000000

You can then generate an additional mmCIF for interest.3dg, to add to the existing whole-genome 20k.1.clean.3dg file in your PyMol.

Best, Tan

wuheng9 commented 3 years ago

Hi Tan,

Thank you very much! I will try to generate the 2D and 3D structure accroding to you suggestions.

Best,

Qian

At 2020-10-28 00:38:36, "Longzhi Tan" notifications@github.com wrote:

Hi Qian,

(1) I visualize contact maps with Juicebox:

java -Xmx2g -jar juicer_tools.jar pre -n contacts.pairs.gz contacts.hic mm10

The resulting file contacts.hic can then be visualized through Juicebox's web browser. Note that you may need to lower the color limit (by clicking the "-" button) to get a desirable color scale.

(2) To use hickit --view, you'll need to compile hickit with make gl=1 (see this section of the README). A pre-compiled version is provided in the v0.1 release (the file hickit-gl; note that -gl is part of the filename, not an option); however, note that it has not been provided for v0.1.1.

(3) For detailed visualization with PyMol, please see this section of the dip-c README. It covers basic 3D viewing. Before that, you'll need to first convert hickit .3dg files to dip-c .3dg files following dip-c's typical workflow.

To highlight a specific genomic region, you can use dip-c reg3 -i interest.reg 20k.1.clean.3dg > interest.3dg first, with a region file interest.reg like this one (for the paternal copy of Chr2:1-2Mb):

2 0 1000000 2000000

You can then generate an additional mmCIF for interest.3dg, to add to the existing whole-genome 20k.1.clean.3dg file in your PyMol.

Best, Tan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

wuheng9 commented 3 years ago

Hi Tan,

When I converted the .3dg file into a .cif file, there was error.

dip-c color -n home/data/t.dip/hg19.chr.txt /home/data/t.dip/SRR7226668.out.3dg | dip-c vis -c /dev/stdin home/data/t.dip/SRR7226668.out.3dg > /home/data/t.dip/SRR7226668.cif Traceback (most recent call last): File "home/software/dip-c-master/dip-c", line 130, in File "home/software/dip-c-master/dip-c", line 130, in main() File "home/software/dip-c-master/dip-c", line 84, in main main() return_value = vis.vis(sys.argv[1:]) File "home/software/dip-c-master/dip-c", line 75, in main File "home/software/dip-c-master/vis.py", line 58, in vis return_value = color.color(sys.argv[1:]) File "home/software/dip-c-master/color.py", line 149, in color g3d_data = file_to_g3d_data(open(args[0], "rb")) File "home/software/dip-c-master/classes.py", line 1427, in file_to_g3d_data g3d_data = file_to_g3d_data(open(args[0], "rb")) File "home/software/dip-c-master/classes.py", line 1427, in file_to_g3d_data g3d_data.add_g3d_particle(string_to_g3d_particle(g3d_file_line.strip())) g3d_data.add_g3d_particle(string_to_g3d_particle(g3d_file_line.strip())) File "home/software/dip-c-master/classes.py", line 1214, in string_to_g3d_particle File "home/software/dip-c-master/classes.py", line 1214, in string_to_g3d_particle hom_name, ref_locus, x, y, z = g3d_particle_string.split("\t") ValueError: need more than 1 value to unpack hom_name, ref_locus, x, y, z = g3d_particle_string.split("\t") ValueError: need more than 1 value to unpack

I am looking forward to your reply!

Best,

Qian

tanlongzhi commented 3 years ago

Hi @wuheng9, can you send me your .3dg file? You can gzip it to reduce size. Best, Tan

wuheng9 commented 3 years ago

Hi Tan,

Here is the .3dg file. Thank you!

Best,

Qian

At 2020-10-30 01:14:56, "Longzhi Tan" notifications@github.com wrote:

Hi @wuheng9, can you send me your .3dg file? You can gzip it to reduce size. Best, Tan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

tanlongzhi commented 3 years ago

Hi @wuheng9, I can't see any files. Perhaps it's easier to attach through the GitHub website rather than email? Best, Tan

wuheng9 commented 3 years ago

Hi Tan,

I try to attach the file through Github website. But I do not see a link where I can send the file.

Best,

Qian

tanlongzhi commented 3 years ago

Hi @wuheng9, did you use Python 3? The dip-c repo only supports Python 2. With Python 2.7, I have no problem running that line on my computer. Best, Tan

wuheng9 commented 3 years ago

Hi Tan,

Thank you very much. Yes, I used the python version 2.7. It seems that the problems are from the visulization step. Do I need specific display device on my computer.

Best,

Qian

At 2020-10-30 11:24:04, "Longzhi Tan" notifications@github.com wrote:

Hi @wuheng9, did you use Python 3? The dip-c repo only supports Python 2. With Python 2.7, I have no problem running that line on my computer. Best, Tan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

wuheng9 commented 3 years ago

Hi Tan,

I stated the pymol software. It shows "Detected 4 CPU cores. Enabled multithreaded rendering. Error: The requested stereo 3D visualization mode is not available." Is that the problem that no color for each chromosome.

Best,

Qian

tanlongzhi commented 3 years ago

Hi @wuheng9, your mmCIF file looks great on my computer, with PyMol 2.3.4. I turned on the color by clicking C->Spectrum->Rainbow on the right. See screenshot:

截屏2020-10-30 下午1 09 34
wuheng9 commented 3 years ago

Hi Tan,

Thank you very much!

Best,

Qian

在 2020-10-31 04:11:22,"Longzhi Tan" notifications@github.com 写道:

Hi @wuheng9, your mmCIF file looks great on my computer, with PyMol 2.3.4. I turned on the color by clicking C->Spectrum->Rainbow on the right. See screenshot:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.