vaquerizaslab / chess

Comparison of Hi-C Experiments using Structural Similarity.
Other
26 stars 6 forks source link

Add coordinate conversion helpers #43

Closed cmdoret closed 3 years ago

cmdoret commented 3 years ago

Hello,

I struggled with converting the feature coordinates from CHESS coordinates to genomic coordinates, and it seems other people had the same problem.

In this pull request, I add two helper functions to help users perform this conversion:

Currently, I put these functions in chess.get_structures, but I'm not sure is this is appropriate. I tried to document the code as much as possible.

cmdoret commented 3 years ago

The goal here is to address concerns raised in #10

liz-is commented 3 years ago

Hi Cyril,

Thanks for looking into this and for the pull request! These sort of helper functions would definitely be useful and we're happy to have others contribute. A couple of questions/suggestions:

cmdoret commented 3 years ago

Hi @liz-is

Thanks for your answer, indeed, I missed the zooming part, and that makes the helper function wrong.

I updated my helper function to have a zoom factor added after rotating coordinates. The rotate_feature function rotates a single (x,y) coordinates relative to a window center. It works as follows:

  1. subtract window (i.e. rotation) center from x and y
  2. use a rotation matrix to rotate the coordinates (angle set to -45, to revert CHESS rotation)
  3. zoom the resulting coordinates (default to 1/0.7 to revert CHESS zoom)
  4. add the center back

The other function, get_feature_coords simply relies on the first one to convert an entire feature (4 coords) and performs the conversion to basepairs.

I used a pair of symmetric features reported by CHESS in real world data to test rotate_feature.

# Coordinates as reported in lost_features.tsv
xmin1, xmax1, ymin1, ymax1 = 187, 192, 88, 95
xmin2, xmax2, ymin2, ymax2 = 187,192,106,113
# Format coords to a matplotlib friendly format
coords_chess = [
    [xmin1, xmin1, xmax1, xmax1, xmin2, xmin2, xmax2, xmax2],
    [ymin1, ymax1, ymin1, ymax1, ymin2, ymax2, ymin2, ymax2],
]

# Rotate coords using my helper function
coords_rotated = list(map(
    lambda c: rotate_feature(c[0], c[1], demo_win.shape[0]),
    zip(*coords_chess)
))
coords_rotated = [
    [c[0] for c in coords_rotated],
    [c[1] for c in coords_rotated],
]

# Region of the matrix from CHESS window
clr = cooler.Cooler('mysample.cool')
demo_win = clr.matrix(sparse=False, balance=True).fetch("chr14:52000001-62000001")
demo_win = np.nan_to_num(demo_win)
# Rotate + zoom matrix using CHESS method
zm1 = clipped_zoom(demo_win, 0.7)
rot1 = ndi.rotate(zm1, 45, reshape=False)

# Visualize original and rotated coordinates with their matrix
fig, ax = plt.subplots(1, 2)
ax[0].imshow(demo_win, vmin=demo_feat.min(), vmax=demo_feat.max())
ax[0].scatter(coords_rotated[0], coords_rotated[1], c='r')
ax[1].imshow(rot1, vmin=demo_feat.min(), vmax=demo_feat.max())
ax[1].scatter(coords_chess[0], coords_chess[1], c='r')
ax[0].set_title('Original window\nManually rotated coords')
ax[1].set_title('CHESS-rotated window\nCHESS coords')

image

If I zoom to see the coordinates better: image

Now if I plot the values of the first features (additional columns of lost_features):

# demo_feat = np.array([many values])
# demo_feat = demo_feat.reshape(xmax1 - xmin1, ymax1 - ymin1)
plt.imshow(demo_feat)

image

The feature image looks similar to the area circled by red points, not exactly the same but I imagine it may have undergone some processing steps. The details of the image seem to be distorted a bit by the rotation also.

These functions would be handy for downstream analysis of regions from CHESS features, but ultimately I think you're right it would definitely be much better if there was no rotation / zoom at all. Thanks for working on that ! :)

liz-is commented 3 years ago

Hi Cyril, sorry for the slow reply and many thanks for the detailed example! Now I understand much better. Part of my confusion was that I was expecting it to return two sets of coordinates for each feature (i.e. the coordinates of the edges of the bounding box), rather than four (the coordinates of each corner of the bounding box). But either is fine, as long as it's clear what it does.

This looks good to me, and I think with this thread as an example/explanation others would be able to use it. However, the original devs should agree before I can merge this in -- @kaukrise @nickmachnik @sgalan what do you think?

nickmachnik commented 3 years ago

This looks good to me, thanks for submitting this @cmdoret ! I think it would be good if @sgalan took a look and approved this, Ill contact her.