Help with generating fig 4A,C from dip-c paper

tarak77 commented 5 years ago

Hi Tan, Dip-c indeed has a lot of relevant analysis which are really useful for me. I was trying out to produce results similar to the ones you have in fig 4A,C in your paper. There is a lot to learn from these figures and also how to generate them(like cross section plots for centromere to telomere organisation, intermingling plots) I was reading up on it in supplementary file but could not follow clearly. It would be great if there is a possibility to provide a sample algorithm for generating these figures?

Thanks again!

tanlongzhi commented 5 years ago

Thank you for the kind words. As the hickit repo improves on the speed and accuracy of pre-processing and 3D modeling, this repo is now primarily geared towards contact/structure analysis.

Below is my PyMol code to generate Fig. 4C left panels, for which you'll need an mmCIF file colored by dip-c color -n. Please bear in mind that I'm not an expert in PyMol and therefore may have written inefficient or weird codes.

viewport 400, 400
set ray_shadows,0

# make lighting plain and flat for cross section
set ambient, 1
set specular, off
set ray_opaque_background, off

# generate a slice for cross section
clip slab, 10

# visualize the cell
as sticks, all
set_bond stick_radius, 0.5, all
spectrum b, rainbow, all, 1, 23

# write output
png out.png, 400, 400, ray=1

tanlongzhi commented 5 years ago

For Fig. 4C right, you'll need an mmCIF file colored by dip-c color -d3 (this may take a while to run for high-resolution structures). The only PyMol line you need to change (from the code above) is the spectrum line. You should use:

spectrumany b, gray90 red, all, 0.3, 1.9

For Fig. 1A, you'll need an mmCIF file colored by dip-c color -L. Again, the only PyMol line you need to change is the spectrum line. You should use:

spectrumany b, blue gray90 red, all, 0, 1

tarak77 commented 5 years ago

Thats Cool, thanks. I am actually more interested towards the quantification of the centromere-telomere organisation as well as intermingling index? I could not understand it from the supplementary text. Basically how to find out the values for the two axes?

tanlongzhi commented 5 years ago

I see. For Fig. 1A, to calculate the horizontal axis, you first need to get the 3D coordinates of all centromeres and telomeres with dip-c pos:

dip-c pos -l cen_tel.leg file.3dg

Then you can manually sum up all the telomere coordinates, and subtract by all the centromere coordinates times 2 (because for each chromosome, we want to add (tel1 - cen) + (tel2 - cen) = tel1 + tel2 - 2*cen).

Finally, you can calculate the vector length of the summed coordinates, and divide it by the total number of particles (basically wc -l file.3dg).

tanlongzhi commented 5 years ago

To calculate the vertical axis, you additionally need the 3D coordinate of the nuclear center of mass, which you can get by averaging all the coordinates in a .3dg file:

awk -F'\t' -v OFS='\t' '{x+=$3;y+=$4;z+=$5;n++}END{print x/n,y/n,z/n}' file.3dg

Then you need to calculate the distance from each centromere and telomere to the nuclear center of mass (namely, their radial positioning).

Finally, you can sum up all the telomere distances, and subtract by all the centromere distances times 2. The result will again be divided by the total number of particles.

tarak77 commented 5 years ago

For the horizontal axis, is there a way to get the cen_tel.leg file? How to define the centromeres and telomeres?

tanlongzhi commented 5 years ago

I just added a script in 15a9a1450e6d2b83a052d2f8294a8eda5747119d. You can now run:

scripts/cen_to_leg.sh color/hg19.chr.cen

The output (color/hg19.chr.cen.leg) will have telomere 1, centromere, telomere 2 for each chromosome.

Centromeres were defined as the midpoint of each listed centromere from the UCSC Table Browser. Telomeres were simply set to the two endpoints of each chromosome.

tarak77 commented 5 years ago

Thats awesome Tan! I will get that working for mm10 which will help understand its dynamics. To be clear, the file.3dg used in horizontal and vertical axis is the output coordinates .3dg file? And while calculating the vector length of summed coordinates, I don't understand the wc -l part? I don't see wc script? Lastly, In the paper you mentioned "total number of particles differ between human and mouse", the total number of particles are the total 3d coordinates?

tanlongzhi commented 5 years ago

Sounds great. Yes, that's the final 3D structure file.

wc is a command and should come with your operating system. Specially, wc -l counts the total number of lines in a file.

Yes, the total number of particles is the total number of lines in a .3dg file. This normalization is necessary because longer genomes will have bigger chromosomes.

tarak77 commented 5 years ago

Ah I thought it was some script. So after running dip-c pos, I know I can use the coordinates produced in terminal, or is there an explicit file produced? Wait, how to differentiate between centromeres and telomeres from the output coordinates?

tanlongzhi commented 5 years ago

Like most dip-c commands, you can redirect the result into a file:

dip-c pos -l cen_tel.leg file.3dg > file.pos

The output has the same ordering as the input .leg file.

tarak77 commented 5 years ago

Cool

tarak77 commented 5 years ago

A quick question, it may be trivial but what does it mean to calculate the vector length of summed coordinates? why does taking that make sense for differentiate between more parallel or more random configuration?

tanlongzhi commented 5 years ago

The vector length is just sqrt(x^2 + y^2 + z^2) for a vector (x, y, z).

More random configuration will lead to cancelation when vectors are summed, leading to a near-zero summed vector.

tarak77 commented 5 years ago

gotcha, thanks

tarak77 commented 5 years ago

Finally, when using dip-c color -L option for fig4A, there seems to be an error

(py27) wc-dhcp178d105:dip-c tarakshisode$ ./dip-c color -L ./color/mm10.chr.txt ./test_cells/deepvariant_testing/f_4cse-8/impute3.round4.clean.3dg | ./dip-c vis -c /dev/stdin ./test_cells/deepvariant_testing/f_4cse-8/impute3.round4.clean.3dg > ./test_cells/deepvariant_testing/f_4cse-8/impute3.round4.clean.n.cif
[M::color] read a 3D structure with 241471 particles at 20000 bp resolution
Traceback (most recent call last):
  File "./dip-c", line 122, in <module>
    main()
  File "./dip-c", line 73, in main
    return_value = color.color(sys.argv[1:])
  File "/Users/tarakshisode/dip-c/color.py", line 163, in color
[M::vis] read a 3D structure with 241471 particles at 20000 bp resolution
    ref_name, ref_len, ref_cen = color_file_line.strip().split("\t")
ValueError: need more than 1 value to unpack
[M::__main__] command: ./dip-c vis -c /dev/stdin ./test_cells/deepvariant_testing/f_4cse-8/impute3.round4.clean.3dg
[M::__main__] finished in 26.9 sec

tanlongzhi commented 5 years ago

You need to use dip-c color -L color/mm10.chr.cen file.3dg, according to the instructions in dip-c color.

tarak77 commented 5 years ago

Thank you very much for pointing that out. Though mentioned in color.py code, and since I am not getting a clear understanding, may I ask what do each thesecolor options mean h, i, I, S, d, r, C, D, s ?

Which option would serve best to show the intermingling between homologs of each chromosome?

tanlongzhi commented 5 years ago

Sure thing. Some additional explanations:

-h: color each locus (e.g. chr1(mat):520000) by its distance to its homologous locus (in this case, 3D distance between chr1(mat):520000 and chr1(pat):520000).
-i: for each locus, calculate the percentage of its 3D-neighboring particles (within a distance threshold) that are from the same chromosome (must be the same haplotype, too).
-I: similar to -i, but instead of percentage, calculate the number of particles.
-S: must be used with -i or -I, to further restrict the calculation (the numerator in the case of -i) to particles within a certain genomic distance (bp) from each locus.
-d: for each locus, calculate the Shannon diversity of chromosomes (two homologs count as two different chromosomes) within a distance threshold.
-r: similar to -d, but instead of Shannon diversity, calculate the total number of chromosomes.
-C: for each locus, calculate its distance to the whole nucleus' center of mass (namely, its radial positioning).
-D: for each locus, calculate its distance to a given locus.
-s: must be used with other options, to smooth the calculated color values by averaging 3D-neighboring particles (within a distance threshold).

tanlongzhi commented 5 years ago

I don't think there's an existing functionality for general intermingling between homologs of each chromosome, because -h only quantifies intermingling between each pair of specific homologous loci.

I would probably use dip-c reg3 to remove everything except the two homologs of a chromosome, for example, leaving only chr1(mat) and chr1(pat) by:

dip-c reg3 -i chr1.reg in.3dg > out.3dg

Then I would use dip-c color -I, which will calculate for each locus if it's near the homologous chromosome.

tarak77 commented 5 years ago

That is really a great help!

For fig 4C, you used spectrumany b, gray90 red, all, 0.3, 1.9. How does one define the minimum(0.3) and maximum(1.9)? or can we find out these ranges from mmcif files?

Similar with fig 4A, spectrumany b, blue gray90 red, all, 0, 1. Why did you select 0,1?

I ask that because, after finding the mmcif file from dip-c color -I, how to get the last two parameters for spectrumany from mmcif files? I want to color only the interacting regions between homologs for a chromosome.

tanlongzhi commented 5 years ago

No problem. I just played around with the numbers until the plot looks great.

For Shannon diversity (-d), you can find what the values mean on Wikipedia. For example, if around a certain locus, there's equal intermingling between 3 chromosomes, the diversity value will be 3*ln(3) = 3.3. If around a certain locus, there's no intermingling at all, the diversity value will be 0.

For centromeres and telomeres (-L), 0 means centromeres and 1 means telomeres. The midpoint between a centromere and a telomere is 0.5.

For your purpose (-I), 0 means no intermingling at all, and a higher value means more intermingling. So you may try 0,1 for a binary coloring.

tarak77 commented 5 years ago

Okay, also when doing -d3, it just means calculate the diversity within a distance of 3 particle radii? While doing-I, should we specify certain distance ?

Moreover, is it possible to determine the genomic location(or coordinates) of the interacting regions from mmcif file?

tanlongzhi commented 5 years ago

Yes and yes. Parameter requirements (such as -I FLOAT) can be found in the manual of the dip-c color command (namely, the text you see when typing only dip-c color).

Yes. Please refer to the lines starting with HETATM. For example, for the line below:

HETATM  .      32  chr1(mat)   003  1  140       22.0263      -27.7424      24.0552      0.0075

The locus chr1(mat):3140000 (the column 003 means Mb, and the column 140 means kb) has a color value of 0.0075.

Alternatively, you can also just redirect the output of dip-c color to a file, without using dip-c vis. Such output file will be much easier to read.

tarak77 commented 5 years ago

Cool!! I am a bit confused about how to use -S with -I. The FLOAT with -Idoes specify the distance. What does it mean to further restrict the calculations by using -S ?

tarak77 commented 5 years ago

Ah I see now, after dip-c reg3 to extract a particular chromosome, what I used is color -h. This gives me the color values(the smaller they are the closer the two homologs are).

Thanks again!

tarak77 commented 5 years ago

Hey @tanlongzhi , In the reply above https://github.com/tanlongzhi/dip-c/issues/24#issuecomment-457716977, it makes sense to take two telomeres for human chromosomes, but when we look at mouse chromosomes, isn't one end centromere and other end as telomere? The calculations might change in this case, right?

tanlongzhi commented 5 years ago

No, I think mouse chromosomes still have telomeres on both ends.

tarak77 commented 5 years ago

Are you sure? I searched online and found this:

tanlongzhi commented 5 years ago

You need to find a figure with telomeres labeled.

tarak77 commented 5 years ago

Its strange that figure like that is not easily found! But I anyhow did summed (tel-cen) vectors for all chromosomes and then took its length. Similar for y-axis. And the results were not different than the previous calculation because in the previous calculations, telomere1 coords equals the centromere coords and therefore one of the parenthesis vanishes.

Sorry for the confusion.

tanlongzhi / dip-c

Help with generating fig 4A,C from dip-c paper #24