higlass / clodius

Clodius is a tool for breaking up large data sets into smaller tiles that can subsequently be displayed using an appropriate viewer.
MIT License
38 stars 21 forks source link

More accurate generation of tiles that span across chromosomes #125

Closed alexander-veit closed 4 years ago

alexander-veit commented 4 years ago

Description

What was changed in this pull request?

This PR address a pixel shift issue that can be observed here: http://54.88.112.8:8170/l/?d=feRVfVIHS8iFojiyh1j9KQ In the example above the entire Chr 2 has state 2 and Chr 3 has state 1. However, the tile between chr3:40 and chr3:240 has state 2. It should have state 1.

The PR modifies the tile generation as follows: Current behavior:

Screen Shot 2020-03-13 at 11 50 05 AM

Proposed behavior:

Screen Shot 2020-03-13 at 11 52 42 AM

Why is it necessary?

Improves accuracy of tile generation across chromosomes

Fixes #___

Checklist

alexpreynolds commented 4 years ago

I am testing today and will report back how this affects alignment.

alexpreynolds commented 4 years ago

Unfortunately, this patch did not seem to work on my end.

I ran a version of clodius with this patch on a sample dataset, copied the test dataset onto the higlass-server instance and flushed the cache, and there still appear to be offset issues at chromosomes past chromosome 1.

Chromosome 1 is a good example of how the 200-base bins should align. In this example link, you can see that the 200-base bins align along 200-base units to the chromosome track:

http://higlass.io/l/?d=KABGmS9FS8SVKsvTGrZrxw

Downstream of chr1, in the border with chr2, is where the offset issue begins. From the start position of chr2, 200-base bins no longer line up with 200-base units:

http://higlass.io/l/?d=BCIDdiE4TXiVxarrw11bFw

Chromosome 3 and those further downstream are affected with different amounts of offset error.

I suspect that discarding one tile that spans a chromosome just brings the next tile upstream, so that the next tile spans the true start of a chromosome and thus would not usually align correctly.

Taking the boundary between chr1 and chr2 as an example, whether the overlapping tile comes from chr1 ("bad case") or chr2 ("good case"), the "good case" tile that spans the start of chr2 does not appear to align to the beginning of chr2. Here is a link to visually demonstrate this, showing how the first chr2 bin starts inside chr1:

http://higlass.io/l/?d=ObUb_bagTlKQcmdZM_ycww

(For this last link, I have added a bar border around epilogos track bar components, to make it more easy to discern the start and end of a bin at the start of a chromosome. Most of the data around the edges of a chromosome is low information and can be difficult to read.)

Perhaps the issue is not with tiling in clodius, but with how the tiles are presented in higlass. One idea I proposed on Slack was to add data to the multivec container representing this overlap per chromosome. On the visualization side, the tiles/sprites are translated by some number of pixels depending on this overlap value, depending on what chromosome or chromosomes are in view. Is there anything that other groups using epilogos tracks may have done to solve this?

alexpreynolds commented 4 years ago

Thinking on this further, this could be an issue for any generic multivec track with data organized in units greater than one base in width (not just epilogos tracks).

alexander-veit commented 4 years ago

Thank you for the comments. I will have a closer look tomorrow, but I just wanted to mention, that the purpose of this PR is not get a perfect alignment beyond chr 1. It won't do that. The purpose was to get rid of offset errors that are larger than 200 bp. Can you confirm that the errors across chromosome boundaries are always within that range?

alexpreynolds commented 4 years ago

I made a second test track with a more clearly colored first bin for each chromosome. This appears to show that the bin is actually positioned after the start of the chromosome, not before. For example:

http://higlass.io/l/?d=eT7F8Kh7R7mCQaCN4KOyNQ

My result doesn't agree with what your figure suggests the patch should do, but on re-reading the code in the patch and the methods being changed, I'm wondering if this patches tile generation (i.e., creating multivec files) or does it intend to patch tile retrieval (i.e., modifying the result of a query to a higlass-server instance)?

alexander-veit commented 4 years ago

Your example is consistent with what I am proposing and how it also works currently. The first bin of a chromosome will always be at or after the start of the chromosome in the visual result. This PR just makes sure that it is not more than 200 bp after the start of the chromosome (see my example in the first post). If you look at the pictures above you can see that the first visible green box is always after the start of chromosome 3. That's also the case without this patch.

The offset errors are due to the fact that tiles are not aware of chromosome boundaries. They are just the regular (in this case 200 bp) subdivisions of a long interval that spans all chromosomes. The bins for each chromosome however are always starting at position 0 of the corresponding chromosome. That means that the bins will line up perfectly with the tile boundaries in the first chromosome but not with the second one since the chromosome boundary does not coincide with a tile boundary.

Please note that the generation of the multivec files is not affected by this PR. It does change how we generate the tiles that are requested by higlass-server.