tanjimin / C.Origami

C.Origami, a prediction and screening framework for cell type-specific 3D chromatin structure.
66 stars 9 forks source link

How to makr trans interaction contacts prediction? #40

Closed linfanxiao closed 6 months ago

linfanxiao commented 6 months ago

Dear developers,

Thanks for your work to develop the state-of-art tool.I have tried the example and my own data for intra chromosomal contacts, looking OK. However, how to predict intra chromosomal like you paper Fig. 4. If there are some example or tutorial to make the inter chromosomal contacts?

Thanks ! https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41587-022-01612-8/MediaObjects/41587_2022_1612_Fig4_HTML.png?as=webp

tanjimin commented 6 months ago

Hi I think those are examples of translocation event which mostly happen in cancer settings. If you don't know then it might not be what you want. C.Origami currently don't have the capacity to predict interactions beyond 2Mb regions.

If you do want to predict results of translocation, you will extract sequence and ATAC/CTCF bigwig tracks and append the two junction points from different chromosomes. Then you can run prediction on the fused inputs.

linfanxiao commented 6 months ago

Thanks, Jimin, I will try it as your suggestion!

linfanxiao commented 6 months ago

Hi, Jimin,

I have a question regarding the comparison of real and predicted HiC data. To resize the image, I used the preserve_range=True parameter. Is this the correct approach? The reason I used it is that the values in the resized image were very small, around 2e-18, whereas my original data ranged from 1 to 50. After setting preserve_range=True, the resized image now has a similar range to the original data.

This is my code below

cooler dump cell0041GLUT1.mcool::/resolutions/10000 --fill-lower -t pixels -r chr2:500,000-2,597,152 -b --no-balance --join --one-based-starts| awk '{print $2-1, $5-1, $7}' > test/cell0041GLUT1.txt

Rscript -e 'library(reshape2); df <- data.table::fread("test/cell0041GLUT1.txt"); df1 <- dcast(df, V1 ~ V2, value.var = "V3", fill = 0);df1=df1[,-1] ;write.table(df1, "test/cell0041GLUT1_wide.txt", sep = "\t",col.names=F,row.names = F, quote = F)'

The next part is processed in Python,plot_utils.py is the developer's script


import numpy as np
import os
import pandas as pd
from skimage.transform import resize
from plot_utils import MatrixPlot

# Open the text file in read mode
file_path = 'cell0041GLUT1_wide.txt'  # Replace with the actual file path
with open(file_path, 'r') as file:
    # Read the contents of the file
    lines = file.readlines()

# Remove any leading or trailing whitespace from each line
lines = [line.strip() for line in lines]

# Convert the lines to a nested list of integers
matrix = [[int(num) for num in line.split()] for line in lines]

# Convert the nested list to a numpy array
mat = np.array(matrix)

# Print the contents of the file
print(mat)
# Resize the matrix to the desired resolution
resized_mat = resize(mat, (256, 256), anti_aliasing=True,preserve_range=True)
print(resized_mat)
bin_size = 8192

# Generate interaction lines and write to file
with open('realcell0041GLUT1_chr2_500000-2597152.txt', 'w') as f:
    for i in range(256):
        for j in range(i, 256):
            start_coord = (i) * bin_size + 500000
            reads = resized_mat[i, j]
            end_coord = (j + 1) * bin_size + 500000
            interaction_line = f"{start_coord} {end_coord} {reads:.8f}"
            f.write(interaction_line + '\n')

# Instantiate MatrixPlot object
image=mat = np.log(resized_mat + 1)
output_path = "./"
prefix = "matrix_plot"
celltype = "realDGLUT1"
chr_name = "chr2"
start_pos = 500000

matrix_plot = MatrixPlot(output_path, image, prefix, celltype, chr_name, start_pos)

# Plot and save the matrix
matrix_plot.plot()
tanjimin commented 6 months ago

Hi @linfanxiao yes you should use preserve_range because skimage is usually used for image processing which is uint8 (0-255). Here these are float tensors.

Also when you compare real and predicted Hi-C data keep in mind that the intensity of predicted matrix is always in the log space (range around 0-10). To get the read counts (range 0 - around 2000), you need to do the exp(x) - 1 transformation.

linfanxiao commented 6 months ago

Thanks, Jimin!