Some general questions...

tarak77 commented 6 years ago

Sorry for going back and forth between theories but I had some general questions:

The statistical property for inter-chromosomal/ long range intra-chromosomal contacts- explained is definitely convincing, but can we show the same via the plot between contact probability Vs genomic distance? I know that for intra-chromosomal- we should expect a slope of -1(fractal globule case), for inter-chromosomal- should be expect a slope of -2? How to find the contact probability in this case?
Coming back to https://github.com/tanlongzhi/dip-c/issues/4#issuecomment-420307166 , you remove short range contacts with unknown haplotypes on both legs. Won't the either or both strategy help in producing more better 3D structures?
In Dip-C, Imputation in done in 2D and then 3D. IN the supplementary material you mentioned

Imputation would occur if the winning haplotype tuple (the shortest 3D distance) yielded a 3D distance ≤ 20 particle radii and ≤ 0.5 times the second shortest 3D distance.

I am a bit confused with the second shortest 3D distance? Could you explain the 3D imputation in general?

--> Thinking along the same lines, are there any limitations in the current methods you think might be improved on? In the Deep learning age, can it help current 3D/4D genome modeling and analysis even further?

Thanks again, I really appreciate it!

tanlongzhi commented 6 years ago

Thank you. What do you mean by "show the same"? This statistical property is conditional (on a given contact) in nature, while the traditional contact probability vs genomic distance plot is unconditional. I'm not sure how one can define a "genomic distance" from an interchromosomal contact, say between chr1:10Mb and chr5:20Mb.
That'll be a nice direction to explore. I tried a few times implementing either but couldn't get it to work robustly. I feel both may be overly confident, and therefore haven't explored it too much.
Sorry about the confusion. Here is an example: Suppose one detected a contact between the paternal allele of chr1 and an unknown allele of chr2. To figure out which allele of chr2, one can calculate the 3D distance between the paternal allele of chr1 and each of the two alleles of chr2. Say the 3D distance is 2.7 between the paternal chr1 and the paternal chr2, and 13.1 between the paternal chr1 and the maternal chr2. Since 2.7 (the shortest) is much shorter than 13.1 (the second shortest), we can impute that the given contact is between the paternal chr1 and the paternal chr2. Does this make sense?

Regarding the future, there're a few ways in which the chemistry might be improved. We and other labs are working on some.

On the computational front, I believe there's a lot of room for improvement:

For example, a large number of SNP-dense mouse data (so far, the Fraser and Tanay labs published a few thousand cells) will provide ground truth haplotype info to train a better imputation algorithm. I think it's an exciting opportunity, although right now it's hard to say how much data must be learned given the extreme stochasticity of 3D genomes.
It might also be possible to devise an algorithm to impute haplotypes without any SNPs, which will be incredibly useful for cancers and mouse genetics.
The 3D modeling may also be tuned to capture more realistic chromatin shapes.

tarak77 commented 6 years ago

Oh my bad. If we look at the Varoquaux model for generating Hi-C data from 3D structure, ij_th interaction count is linked with the ij_th pairwise distance via the probability model:

C_ij ~ Poisson(b||x_i - x_j ||^a)

Here we take a=-3 and b>0. The a value comes from the literature:

The contact count is inversely proportional to the genomic distance (c ~ s^−1), whereas the volume scales linearly with the subchain length (d^3 ~ s), from which we deduce a relationship between d and c of the form (d ~ c^−1/3).

I was thinking that for the inter chromosomal Hi-C data generation, should we use the same a? or consider

The contact count is inversely proportional to the genomic distance (c ~ s^−2), whereas the volume scales linearly with the subchain length (d^3 ~ s), from which we deduce a relationship between d and c of the form (d ~ c^−1/6).

Okay
Yes, it makes sense now. Thanks!

An algorithm for imputation without SNP data? How will one go about that? Any reference I could use would be awesome!

tanlongzhi commented 6 years ago

Again, how does one define ||x_i - x_j|| for interchromosomal contacts, for example between chr1:10Mb and chr5:20Mb?

I mentioned to you this paper in a different thread. The authors didn't use SNP data but managed get structures for one chromosome. I imagine a similar algorithm might work for the whole genome.

tarak77 commented 6 years ago

I could be wrong, but in general Lets say we have the 3D genome wide coordinates from a haploid cell. We could find the distance matrix using euclidean distances, and then cut the matrix to get the interchromosomal distances between chr1 and chr5. From this cut matrix, to reconstruct the contacts, we could use the Varoquaux model with exponent a to be -6 instead of -3?

tanlongzhi commented 6 years ago

You're talking about euclidean distances (in nanometers) while those models are using genomic distances (in basepairs).

tarak77 commented 6 years ago

Sorry about the confusion, in the Varoquaux model https://github.com/hiclib/pastis/blob/master/examples/plot_generate_data.py , to reconstruct the genome wide contact matrix euclidean distances are raised to the power -3. I was wondering should it be different for inter chromosomal matrices? like -6(kinda based on above logic)?

lh3 commented 6 years ago

Neither hickit nor nuc_dynamics uses a Poisson model. If you like a Poisson model, try another tool. There are plenty of them.

tanlongzhi commented 6 years ago

I see what you meant. You want to infer 3D distance from the number of contacts (from either bulk or single-cell Hi-C) between two particles. This power law is assumed to be -1/3 in this line of code of nuc_dynamics, regardless of intra or inter.

I don't have a strong opinion on this matter, or power laws in general. There have been several experimental studies on the relationship between 3D distance and the number of intrachromosomal contacts in bulk Hi-C: Wang et al. 2016 and Fudenberg & Imakaev 2017, to name a few. The relationship between 3D distance and the number of interchromosomal contacts in bulk Hi-C (which would be a good research project), or the number of any contacts in single-cell Hi-C (which of course would require imaging and doing Hi-C on the same cell), remains an open problem.

tarak77 commented 6 years ago

Right, I understand now.

I was looking up the supplementary text for the Fig 2A from paper by Aiden et al. (2009)showing the existence of chromosome territories(2A) and the inter chromosomal contacts(2B)

and the corresponding supplementary text

Presence of Chromosome Territories. The total number of possible interactions at a given genomic distance was computed explicitly for each chromosome and compared to the actual number of interactions at that distance. (The possible number of pairs of genomic positions separated by d on a given chromosome is Lc-d, where Lc is the length of the chromosome.) To obtain the interchromosomal averages, the number of observed interactions between loci on a pair of chromosomes was divided by the number of possible interactions between the two chromosomes (the product of the number of loci on each chromosome). When multiple chromosome pairings were being averaged, such as in the computation of In(s), the numerators and denominators were summed independently. The genome wide average, I(s), is therefore the result of dividing the total number of interactions at a distance s by the number of possible interactions at distance s summed over all chromosomes.

Proximity of Chromosome Territories. The expected number of interchromosomal interactions for each chromosome pair i,j was computed by multiplying the fraction of interchromosomal reads containing i with the fraction of interchromosomal reads containing j and multiplying by the total number of interchromosomal reads. The enrichment was computed by taking the actual number of interactions observed between i and j and dividing it by the expected value.

I quite don't follow the inter chromosomal contact probability computation?

With regard to single cell, Fig 5C from https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005292

shows that the single chromosome model shows a more complicated packing than the fractal globule. I wonder how they generated that plot?

Thanks again!

tanlongzhi commented 6 years ago

Glad to hear.

Your two additional questions may be best answered by the authors of the two papers, respectively.

tarak77 commented 6 years ago

Yes I will do that, thanks!

tanlongzhi / dip-c

Some general questions... #16