Closed tarak77 closed 5 years ago
Hi @tarak77, Hope all is well for you too!
tiedrank
. Normally, tiedrank
produces values from 1 (lowest) to N (highest); I rescaled the output so that they are from 0 (lowest) to 1 (highest).dip-c mgcolor
. I'll explain them later in more details.The main idea behind the PCA plot is to average CpG frequencies in 3D space, which is achieved with dip-c color -c color/hg19.cpg.20k.txt -s3
(-s3
sets the 3D distance threshold to 3.0 particle radii).
The sentence you referred to meant confirming the PCA with colors produced with dip-c color2
(from raw contact files).
Thanks @tanlongzhi !
Sounds good. Really interested in seeing how to use dip-c mgcolor
to cluster my data.
Heya,
How to use -m
option from the mgcolor
code?? because I do have some particles missing from some files.
Hey @tarak77, below are some instructions on data clustering:
For each sample, calculate the raw (un-normalized) chromatin compartment value along the diploid genome:
dip-c color -c color/hg19.cpg.20k.txt -s3 sample.3dg > sample.cpg_s3.color
Merge compartment values from all samples into a matrix:
dip-c mgcolor *.cpg_s3.color > cpg_s3.colors
The output file is now ready for downstream analysis. Each row is a genomic locus (e.g. a 20-kb bin), while each column is a sample. The file contains both row headers (two columns: homolog name, genomic coordinate) and column headers (one row: sample file name).
By default, any missing data (a locus present in one sample but not in another) will be represented by -1.0; but this value can be changed for example with -m -2.0
to -2.0.
For most of my PCA analysis, I simply remove any rows with missing data.
I will then rank-normalize each sample, for example with tiedrank
(MATLAB) on each column.
The matrix is now ready for PCA, for example with pca
(MATLAB).
Cool, what's the use of -d
option in mgcolor
? How to use it?
Without -d
, dip-c mgcolor
basically treats the two homologs of each chromosome as two unrelated chromosomes. To analyze the relationship between the two homologs (for example, Fig. 3c, Fig. S12, Fig. S16), however, -d
must be used.
With -d
, you'll notice that the output matrix now has two columns for each cell: the paternal and maternal homologs. Otherwise, everything is the same.
Okay I see. I will work on it.
Hi @tanlongzhi , Hope you are doing good!
From the notes, what does it mean to
rank normalise compartments
from 0 to 1?Will it also be possible to explain how to obtain the plots for fig 4B,E from paper?
From the computed
cpg.txt
files,dip-c color -c
basically interpolates the CG frequencies with the genomic location coordinates in each cell. I don't understand how to do cell typing based on these obtained.cif
files? In the paper you mentionedI might be overthinking it but I suppose you confirmed your PCA clustering based on genomic structure with the plots in fig S17 obtained from contacts?
Any help again will be really great!