Open tarak77 opened 5 years ago
Hi Tarak,
Thanks to @lh3's awesome implementation, hickit is especially fast in two steps: (1) hickit.js reads the SNP list into the memory and then does a single, very fast pass over the SAM file; and (2) hickit does very fast dedup by efficiently scanning the contacts. Besides preprocessing, hickit is also fast in imputation and 3D modeling.
Thanks @tanlongzhi @lh3. Would it be awesome for hickit to implement the 3D imputation step as well!? When I compare the 3D models between the two repo, dip-c gives more pronounced and contiguous at 20kb resolution.
The seemingly "noncontiguous" hickit models are in fact a clever design of adaptive genomic resolution, where 3D particles (20 kb each, in this case) that harbor no contacts are automatically merged with nearby particles to form larger particles (40 kb each, or larger). This can be a better way to handle repetitive regions such as the centromeres, and has nothing to do with 3D vs 2D imputation.
This mode can be turned off to produce more dip-c-/nuc_dynamics-like models by the hickit -M
option.
Sorry, but why is it a better way to handle repetitive regions like centromeres?
ALso, in https://github.com/tanlongzhi/dip-c/issues/16#issuecomment-424486496 you mentioned about more realistic modeling of the chromatin, did you mean how the coarse grained chromatin polymer is defined? Using Strings and Binders Switch models will give more realistic conformations?
In dip-c/nuc_dynamics, regions that harbor no contacts (and therefore no constraints) will float away from the nucleus, forming large, free-floating loops either outside or on the surface of the nucleus. Although these regions are later deleted (for example, by dip-c clean3
), such floating behavior is certainly incorrect.
Your understanding is correct. Recent super-resolution studies sometimes show a variety of complicated chromatin shapes. Perhaps the models you mentioned will help.
Btw this repo is now updated (a05858fcd58fa959aec49ba98dba22190d40231b) with dip-c vis -a
to connect adjacent beads regardless of their genomic separation (bp) to support hickit 3D models.
I see, I have some observations from the following steps I used for my new data:
hickit
is more faster in preprocessing steps, I implement hickit before dip-c for modellingcontact.pairs
and impute.pairs
from hickit to conatcts.con.gz
and impute.con.gz
respectively readme
of this repo with the default resolutions. Note that while implementing dip-c impute3 -3
i use contacts.con.gz
instead of clean.con.gz
as my second argument..3dg
to .cif
file using the default parameters in readme
, without the recent modification https://github.com/tanlongzhi/dip-c/issues/17#issuecomment-426632918Below is the screenshot shown for chr2 mat and pat
As you can see these extended and free floating loops were also seen in some other chromosomal pairs. Mostly these look like defects and not the inherit structure. Even after all the dip-c clean3 -c
steps from readme
they were present. Can we remove these defects manually?
The structures differed when we compare them with the modeling from .3dg
file of hickit.
For getting the .cif
from .3dg
of hickit, i did:
.3dg
file at 20kb resolution from hickitscripts/hickit_3dg_to_3dg_rescale.sh
to convert to dip-c .3dg
file..cif
file using the default command mentioned in readme.
Again i didnt use the modificationdip-c vis -a
I get the following screenshot for chr2 mat and pat
From hickit .3dg
files, i didnt see these open/floating loopy structure. But i could see these non contiguous particles. And again as you mentioned is that we should find 3D models from hickit as a lower resolution?
But again i have not tried the hickit -M
option nor the dip-c vis -a
. From what i have, dip-c models look much better(except these defects) than hickit. Probably the updates you mentioned will make hickit models even better?
dip-c models look much better
This is interesting. Why do you think the dip-c model looks much better? Because the backbone threads look smoother? Does the 3d model of haploid hi-c also look smooth?
For the published pbmc-11, dip-c imputation gives us 494k contacts, while hickit gives 694k. One possibility is that hickit is too aggressive and imputes non-existing contacts which distort the backbone threads.
@lh3 Just by looking, yes dip-c looks better, but its these free floating loops @tanlongzhi was talking about- are still present even after dip-c clean3
. Is there a way to manually remove them?
Regarding hickit, as @tanlongzhi mentioned in https://github.com/lh3/hickit/issues/5#issuecomment-421670815, finding models at low resolution will help? For haploid cell, dip-c/nuc_dynamics
did give a smooth backbone threads but again some free floating loops at 100kb, compared to hickit where there were no free floating loops.
Again, for the diploid cell, in above comments- were the screenshots for chr2, below are the screnshots for chr17 at 20kb:
dip-c
hickit
It seems to me that in both cases, you're observing aberrantly stretched backbones rather than free-floating loops. (Free-floating loops are soft and bendy; these are very straight.)
This seems to be the result of incorrect imputation, mistakenly assigning intrahomologous contacts to interhomologous contacts.
Btw please make sure to use the latest scripts/hickit_3dg_to_3dg_rescale.sh
because I recently fixed a bug of forgetting to multiply by 2.
Ok, i am also trying to use dip-c vis -a
on my new rescaled .3dg files from hickit, but get the error:
.
.
M::color] analyzed 171000 particles (99.55%)
[M::color] writing 171776 colors (100.0%)
Traceback (most recent call last):
File "./dip-c", line 130, in <module>
main()
File "./dip-c", line 84, in main
return_value = vis.vis(sys.argv[1:])
File "/home/tshisode/dip-c/vis.py", line 58, in vis
g3d_data = file_to_g3d_data(open(args[0], "rb"))
File "/home/tshisode/dip-c/classes.py", line 1427, in file_to_g3d_data
g3d_data.add_g3d_particle(string_to_g3d_particle(g3d_file_line.strip()))
File "/home/tshisode/dip-c/classes.py", line 1214, in string_to_g3d_particle
hom_name, ref_locus, x, y, z = g3d_particle_string.split("\t")
ValueError: need more than 3 values to unpack
Traceback (most recent call last):
File "./dip-c", line 130, in <module>
main()
File "./dip-c", line 75, in main
return_value = color.color(sys.argv[1:])
File "/home/tshisode/dip-c/color.py", line 246, in color
sys.stdout.write("\t".join([hom_name, str(ref_locus), str(color_data[(hom_name, ref_locus)])]) + "\n")
IOError: [Errno 32] Broken pipe
Any help?
Just by looking, yes dip-c looks better
I was not talking about "free floating loops". I know those are probably imputation errors as @tanlongzhi said. I am asking how you know dip-c looks better. Is it only because it looks smoother? Is smoother always more correct?
Yes @lh3, its just the smoothness. Its may not be necessary that smoother is always correct. Need to do more analysis on many cells.
Hi @tanlongzhi ,
How to implement dip-c vis -a
https://github.com/tanlongzhi/dip-c/issues/17#issuecomment-426632918 ? to connect adjacent beads to support hickit models.
You can run something like this:
dip-c color -n color/hg19.chr.txt cell.3dg | dip-c vis -a -c /dev/stdin cell.3dg > cell.n.cif
cool
A quick question: is it possible/valid to run the 3D modeling and imputation steps of dip-c on many threads like bwa mem does?
Unfortunately multithreaded modeling and imputation are not currently supported.
Okay, is it valid to multithread though?
Hi @tanlongzhi ,
I recently noticed that you added hickit_3dg_to_3dg_rescale_backbone.sh
, hickit_3dg_to_3dg_rescale_unit.sh
in scripts. Which should I use to convert hicks.3dg
to dip-c.3dg
?
Hi Tarak,
The script hickit_3dg_to_3dg_rescale_unit.sh
uses the new unit
information that @lh3 kindly added (lh3/hickit@04dd6f04fc0678c06c054715285bdc4914df8332) to the header of hickit .3dg
files. This script will set the optimal backbone distance to 2.0, which should be the same setting as nuc_dynamics.
The script hickit_3dg_to_3dg_rescale_backbone.sh
instead sets the median of the actual backbone distance to 2.0, and can work on old and new hickit .3dg
files.
Neither of the scripts has been tested thoroughly.
Okay I see now.
Just thinking out loud, and sorry for going back and forth, but is it valid to say that multithreaded could be used for modeling and 3D imputations for this repo?
I'm not sure. You can probably ask 3D experts. Currently, multithreading is not my top priority, because for each cell, we typically needs 3-5 replicates. Each replicate is run as a separate thread, which is already a 3-5X speedup.
Okay cool
Hello tanlongzhi,
Can you let me know how you generated fig.2F from the paper (Three-dimensional genome structures of single diploid human cells)
Thanks in advance
Hi @jijibio,
Below are some of my PyMol codes for one of Fig. 2F's panels. You must start with an mmCIF file colored by dip-c color -l
.
First you'll need to select the highlighted domain and name it tad
:
tad_chain = "2(pat)"
tad_start = 206840000
tad_end = 209280000
tad_start_string = str(tad_start).rjust(9,'0')
tad_start_resn = tad_start_string[0:3]
tad_start_name = tad_start_string[3:6]
cmd.select("tad_start", "chain \"" + tad_chain + "\" and resn " + tad_start_resn + " and name " + tad_start_name)
tad_start_id = cmd.id_atom("tad_start")
tad_end_string = str(tad_end).rjust(9,'0')
tad_end_resn = tad_end_string[0:3]
tad_end_name = tad_end_string[3:6]
cmd.select("tad_end", "chain \"" + tad_chain + "\" and resn " + tad_end_resn + " and name " + tad_end_name)
tad_end_id = cmd.id_atom("tad_end")
cmd.select("tad", "id " + str(tad_start_id) + "-" + str(tad_end_id))
In the same way, you can select the gray domain and name it old_tad
:
tad_chain = "2(pat)"
tad_start = 204880000
tad_end = 221360000
tad_start_string = str(tad_start).rjust(9,'0')
tad_start_resn = tad_start_string[0:3]
tad_start_name = tad_start_string[3:6]
cmd.select("tad_start", "chain \"" + tad_chain + "\" and resn " + tad_start_resn + " and name " + tad_start_name)
tad_start_id = cmd.id_atom("tad_start")
tad_end_string = str(tad_end).rjust(9,'0')
tad_end_resn = tad_end_string[0:3]
tad_end_name = tad_end_string[3:6]
cmd.select("tad_end", "chain \"" + tad_chain + "\" and resn " + tad_end_resn + " and name " + tad_end_name)
tad_end_id = cmd.id_atom("tad_end")
cmd.select("old_tad", "id " + str(tad_start_id) + "-" + str(tad_end_id))
After positioning the camera, you can generate a figure for the highlighted domain:
viewport 400, 400
set ray_shadows,0
hide all
select chr, chain "2(pat)"
color white, old_tad
as sticks, tad
set_bond stick_radius, 0.5, all
spectrum b, rainbow, tad
png tad.png, 800, 800, ray=1
Then another figure with the gray domain, too:
show sticks, old_tad
png old_tad.png, 800, 800, ray=1
Finally, you can overlay the two images (for example, in Illustrator), with old_tad.png
on top and half transparent.
Hello tanlongzhi,
Thanks for your quick response. I am working with bulk HIC data, Is it possible to get the 3D model for the same.
Thanks
Hi @jijibio,
The 3D modeling part of this repo is primarily geared towards single cells, not bulk (the same as this thread we had in the hickit repo).
For bulk 3D modeling, there have been multiple papers on this subject in the past few years. Please refer to those for your purpose.
Hi Dr. Tan,
I recently run the dip-c seg and got errors as follows:
Traceback (most recent call last):
File "/THL8/home/liubin/software/dip-c-master/dip-c", line 130, in
Hi @liubinnk1,
Just to confirm, did you use Python 2 rather than 3? This repo is only compatible with Python 2.
In addition, our latest workflow (see README) uses hickit for this pre-processing step, which is much faster.
Best, Tan
Hi Tan,
I solved that problem by using python 2.7. Thank you so much for your reply.
Best,
Bin Liu
发件人:Longzhi Tan notifications@github.com 发送日期:2020-10-10 06:15:27 收件人:tanlongzhi/dip-c dip-c@noreply.github.com 抄送人:liubinnk1 liubinnk@nankai.edu.cn,Mention mention@noreply.github.com 主题:Re: [tanlongzhi/dip-c] Dip-c vs hickit for preprocessing (#17)
Hi Tan,
Happy Monday! I remember reading in the readme of this repo that hickit is much faster than dip-c. I know that hickit does not implement the 3D imputation step like Dip-c does. But during the read alignment, both use BWA MEM aligner- this aligner is based on BWT/FM-index and gives improved performance but a large index build up time.
I ask these questions because instead of working on a single cell, I wonder how much time it will take to generate many models from many cells. Generating m number of models will help in downstream analysis. I know that we can generate m models for a given cell explicitly https://github.com/tanlongzhi/dip-c/issues/5#issuecomment-421641544 , and we could do this in parallel for many cells?
Thanks again!