tanlongzhi / dip-c

Tools to analyze Dip-C (or other 3C/Hi-C) data
61 stars 18 forks source link

Regarding chromatin compartments #25

Closed tarak77 closed 5 years ago

tarak77 commented 5 years ago

Hi @tanlongzhi , Hope you are doing good!

  1. From the notes, what does it mean to rank normalise compartments from 0 to 1?

  2. Will it also be possible to explain how to obtain the plots for fig 4B,E from paper?

From the computed cpg.txt files, dip-c color -c basically interpolates the CG frequencies with the genomic location coordinates in each cell. I don't understand how to do cell typing based on these obtained .cif files? In the paper you mentioned

Our conclusions held if compartments were defined on the basis of contacts

I might be overthinking it but I suppose you confirmed your PCA clustering based on genomic structure with the plots in fig S17 obtained from contacts?

Any help again will be really great!

tanlongzhi commented 5 years ago

Hi @tarak77, Hope all is well for you too!

  1. For each cell, I rank-normalized colors of its N particles with MATLAB's tiedrank. Normally, tiedrank produces values from 1 (lowest) to N (highest); I rescaled the output so that they are from 0 (lowest) to 1 (highest).
  2. Fig. 4B, E basically involved merging all cells with dip-c mgcolor. I'll explain them later in more details.

The main idea behind the PCA plot is to average CpG frequencies in 3D space, which is achieved with dip-c color -c color/hg19.cpg.20k.txt -s3 (-s3 sets the 3D distance threshold to 3.0 particle radii).

The sentence you referred to meant confirming the PCA with colors produced with dip-c color2 (from raw contact files).

tarak77 commented 5 years ago

Thanks @tanlongzhi ! Sounds good. Really interested in seeing how to use dip-c mgcolor to cluster my data.

tarak77 commented 5 years ago

Heya, How to use -m option from the mgcolor code?? because I do have some particles missing from some files.

tanlongzhi commented 5 years ago

Hey @tarak77, below are some instructions on data clustering:

  1. For each sample, calculate the raw (un-normalized) chromatin compartment value along the diploid genome:

    dip-c color -c color/hg19.cpg.20k.txt -s3 sample.3dg > sample.cpg_s3.color
  2. Merge compartment values from all samples into a matrix:

    dip-c mgcolor *.cpg_s3.color > cpg_s3.colors

    The output file is now ready for downstream analysis. Each row is a genomic locus (e.g. a 20-kb bin), while each column is a sample. The file contains both row headers (two columns: homolog name, genomic coordinate) and column headers (one row: sample file name).

By default, any missing data (a locus present in one sample but not in another) will be represented by -1.0; but this value can be changed for example with -m -2.0 to -2.0.

  1. For most of my PCA analysis, I simply remove any rows with missing data.

  2. I will then rank-normalize each sample, for example with tiedrank (MATLAB) on each column.

  3. The matrix is now ready for PCA, for example with pca (MATLAB).

tarak77 commented 5 years ago

Cool, what's the use of -d option in mgcolor? How to use it?

tanlongzhi commented 5 years ago

Without -d, dip-c mgcolor basically treats the two homologs of each chromosome as two unrelated chromosomes. To analyze the relationship between the two homologs (for example, Fig. 3c, Fig. S12, Fig. S16), however, -d must be used.

With -d, you'll notice that the output matrix now has two columns for each cell: the paternal and maternal homologs. Otherwise, everything is the same.

tarak77 commented 5 years ago

Okay I see. I will work on it.