Closed tarak77 closed 5 years ago
Thank you for the kind words. As the hickit repo improves on the speed and accuracy of pre-processing and 3D modeling, this repo is now primarily geared towards contact/structure analysis.
Below is my PyMol code to generate Fig. 4C left panels, for which you'll need an mmCIF file colored by dip-c color -n
. Please bear in mind that I'm not an expert in PyMol and therefore may have written inefficient or weird codes.
viewport 400, 400
set ray_shadows,0
# make lighting plain and flat for cross section
set ambient, 1
set specular, off
set ray_opaque_background, off
# generate a slice for cross section
clip slab, 10
# visualize the cell
as sticks, all
set_bond stick_radius, 0.5, all
spectrum b, rainbow, all, 1, 23
# write output
png out.png, 400, 400, ray=1
For Fig. 4C right, you'll need an mmCIF file colored by dip-c color -d3
(this may take a while to run for high-resolution structures). The only PyMol line you need to change (from the code above) is the spectrum
line. You should use:
spectrumany b, gray90 red, all, 0.3, 1.9
For Fig. 1A, you'll need an mmCIF file colored by dip-c color -L
. Again, the only PyMol line you need to change is the spectrum
line. You should use:
spectrumany b, blue gray90 red, all, 0, 1
Thats Cool, thanks. I am actually more interested towards the quantification of the centromere-telomere organisation as well as intermingling index? I could not understand it from the supplementary text. Basically how to find out the values for the two axes?
I see. For Fig. 1A, to calculate the horizontal axis, you first need to get the 3D coordinates of all centromeres and telomeres with dip-c pos
:
dip-c pos -l cen_tel.leg file.3dg
Then you can manually sum up all the telomere coordinates, and subtract by all the centromere coordinates times 2 (because for each chromosome, we want to add (tel1 - cen) + (tel2 - cen) = tel1 + tel2 - 2*cen).
Finally, you can calculate the vector length of the summed coordinates, and divide it by the total number of particles (basically wc -l file.3dg
).
To calculate the vertical axis, you additionally need the 3D coordinate of the nuclear center of mass, which you can get by averaging all the coordinates in a .3dg
file:
awk -F'\t' -v OFS='\t' '{x+=$3;y+=$4;z+=$5;n++}END{print x/n,y/n,z/n}' file.3dg
Then you need to calculate the distance from each centromere and telomere to the nuclear center of mass (namely, their radial positioning).
Finally, you can sum up all the telomere distances, and subtract by all the centromere distances times 2. The result will again be divided by the total number of particles.
For the horizontal axis, is there a way to get the cen_tel.leg
file?
How to define the centromeres and telomeres?
I just added a script in 15a9a1450e6d2b83a052d2f8294a8eda5747119d. You can now run:
scripts/cen_to_leg.sh color/hg19.chr.cen
The output (color/hg19.chr.cen.leg
) will have telomere 1, centromere, telomere 2 for each chromosome.
Centromeres were defined as the midpoint of each listed centromere from the UCSC Table Browser. Telomeres were simply set to the two endpoints of each chromosome.
Thats awesome Tan! I will get that working for mm10 which will help understand its dynamics.
To be clear, the file.3dg
used in horizontal and vertical axis is the output coordinates .3dg file?
And while calculating the vector length of summed coordinates, I don't understand the wc -l
part? I don't see wc
script?
Lastly, In the paper you mentioned "total number of particles differ between human and mouse"
, the total number of particles are the total 3d coordinates?
Sounds great. Yes, that's the final 3D structure file.
wc
is a command and should come with your operating system. Specially, wc -l
counts the total number of lines in a file.
Yes, the total number of particles is the total number of lines in a .3dg
file. This normalization is necessary because longer genomes will have bigger chromosomes.
Ah I thought it was some script.
So after running dip-c pos
, I know I can use the coordinates produced in terminal, or is there an explicit file produced?
Wait, how to differentiate between centromeres and telomeres from the output coordinates?
Like most dip-c commands, you can redirect the result into a file:
dip-c pos -l cen_tel.leg file.3dg > file.pos
The output has the same ordering as the input .leg
file.
Cool
A quick question, it may be trivial but what does it mean to calculate the vector length of summed coordinates? why does taking that make sense for differentiate between more parallel
or more random
configuration?
The vector length is just sqrt(x^2 + y^2 + z^2) for a vector (x, y, z).
More random configuration will lead to cancelation when vectors are summed, leading to a near-zero summed vector.
gotcha, thanks
Finally, when using dip-c color -L
option for fig4A, there seems to be an error
(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c color -L ./color/mm10.chr.txt ./test_cells/deepvariant_testing/f_4cse-8/impute3.round4.clean.3dg | ./dip-c vis -c /dev/stdin ./test_cells/deepvariant_testing/f_4cse-8/impute3.round4.clean.3dg > ./test_cells/deepvariant_testing/f_4cse-8/impute3.round4.clean.n.cif
[M::color] read a 3D structure with 241471 particles at 20000 bp resolution
Traceback (most recent call last):
File "./dip-c", line 122, in <module>
main()
File "./dip-c", line 73, in main
return_value = color.color(sys.argv[1:])
File "/Users/tarakshisode/dip-c/color.py", line 163, in color
[M::vis] read a 3D structure with 241471 particles at 20000 bp resolution
ref_name, ref_len, ref_cen = color_file_line.strip().split("\t")
ValueError: need more than 1 value to unpack
[M::__main__] command: ./dip-c vis -c /dev/stdin ./test_cells/deepvariant_testing/f_4cse-8/impute3.round4.clean.3dg
[M::__main__] finished in 26.9 sec
You need to use dip-c color -L color/mm10.chr.cen file.3dg
, according to the instructions in dip-c color
.
Thank you very much for pointing that out.
Though mentioned in color.py
code, and since I am not getting a clear understanding, may I ask what do each thesecolor
options mean
h, i, I, S, d, r, C, D, s
?
Which option would serve best to show the intermingling between homologs of each chromosome?
Sure thing. Some additional explanations:
-h
: color each locus (e.g. chr1(mat):520000) by its distance to its homologous locus (in this case, 3D distance between chr1(mat):520000 and chr1(pat):520000).-i
: for each locus, calculate the percentage of its 3D-neighboring particles (within a distance threshold) that are from the same chromosome (must be the same haplotype, too).-I
: similar to -i
, but instead of percentage, calculate the number of particles.-S
: must be used with -i
or -I
, to further restrict the calculation (the numerator in the case of -i
) to particles within a certain genomic distance (bp) from each locus.-d
: for each locus, calculate the Shannon diversity of chromosomes (two homologs count as two different chromosomes) within a distance threshold.-r
: similar to -d
, but instead of Shannon diversity, calculate the total number of chromosomes.-C
: for each locus, calculate its distance to the whole nucleus' center of mass (namely, its radial positioning).-D
: for each locus, calculate its distance to a given locus.-s
: must be used with other options, to smooth the calculated color values by averaging 3D-neighboring particles (within a distance threshold).I don't think there's an existing functionality for general intermingling between homologs of each chromosome, because -h
only quantifies intermingling between each pair of specific homologous loci.
I would probably use dip-c reg3
to remove everything except the two homologs of a chromosome, for example, leaving only chr1(mat) and chr1(pat) by:
dip-c reg3 -i chr1.reg in.3dg > out.3dg
Then I would use dip-c color -I
, which will calculate for each locus if it's near the homologous chromosome.
That is really a great help!
For fig 4C, you used spectrumany b, gray90 red, all, 0.3, 1.9
. How does one define the minimum(0.3) and maximum(1.9)? or can we find out these ranges from mmcif files?
Similar with fig 4A, spectrumany b, blue gray90 red, all, 0, 1
. Why did you select 0,1?
I ask that because, after finding the mmcif file from dip-c color -I
, how to get the last two parameters for spectrumany
from mmcif files? I want to color only the interacting regions between homologs for a chromosome.
No problem. I just played around with the numbers until the plot looks great.
For Shannon diversity (-d
), you can find what the values mean on Wikipedia. For example, if around a certain locus, there's equal intermingling between 3 chromosomes, the diversity value will be 3*ln(3) = 3.3. If around a certain locus, there's no intermingling at all, the diversity value will be 0.
For centromeres and telomeres (-L
), 0 means centromeres and 1 means telomeres. The midpoint between a centromere and a telomere is 0.5.
For your purpose (-I
), 0 means no intermingling at all, and a higher value means more intermingling. So you may try 0,1 for a binary coloring.
Okay, also when doing -d3
, it just means calculate the diversity within a distance of 3
particle radii?
While doing-I
, should we specify certain distance ?
Moreover, is it possible to determine the genomic location(or coordinates) of the interacting regions from mmcif file?
Yes and yes. Parameter requirements (such as -I FLOAT
) can be found in the manual of the dip-c color
command (namely, the text you see when typing only dip-c color
).
Yes. Please refer to the lines starting with HETATM
. For example, for the line below:
HETATM . 32 chr1(mat) 003 1 140 22.0263 -27.7424 24.0552 0.0075
The locus chr1(mat):3140000 (the column 003
means Mb, and the column 140
means kb) has a color value of 0.0075.
Alternatively, you can also just redirect the output of dip-c color
to a file, without using dip-c vis
. Such output file will be much easier to read.
Cool!! I am a bit confused about how to use -S
with -I
. The FLOAT
with -I
does specify the distance. What does it mean to further restrict the calculations by using -S
?
Ah I see now, after dip-c reg3
to extract a particular chromosome, what I used is color -h
. This gives me the color values(the smaller they are the closer the two homologs are).
Thanks again!
Hey @tanlongzhi , In the reply above https://github.com/tanlongzhi/dip-c/issues/24#issuecomment-457716977, it makes sense to take two telomeres for human chromosomes, but when we look at mouse chromosomes, isn't one end centromere and other end as telomere? The calculations might change in this case, right?
No, I think mouse chromosomes still have telomeres on both ends.
Are you sure? I searched online and found this:
You need to find a figure with telomeres labeled.
Its strange that figure like that is not easily found! But I anyhow did summed (tel-cen) vectors for all chromosomes and then took its length. Similar for y-axis. And the results were not different than the previous calculation because in the previous calculations, telomere1 coords equals the centromere coords and therefore one of the parenthesis vanishes.
Sorry for the confusion.
Hi Tan, Dip-c indeed has a lot of relevant analysis which are really useful for me. I was trying out to produce results similar to the ones you have in fig 4A,C in your paper. There is a lot to learn from these figures and also how to generate them(like cross section plots for centromere to telomere organisation, intermingling plots) I was reading up on it in supplementary file but could not follow clearly. It would be great if there is a possibility to provide a sample algorithm for generating these figures?
Thanks again!