iobio / iobio-charts

0 stars 1 forks source link

Handle coverage for sparse data, like exome files #30

Closed anderspitman closed 2 months ago

anderspitman commented 3 months ago

I believe the read depth chart is currently calculating the average coverage over the entire genome. This will yield a very small Y axis, so even if you zoom into an area where you know there's coverage, you can't tell what the value is. We need to figure out a solution to this.

One thing that would help would be to add a popup that shows what the coverage of a specific bar is.

YangQi007 commented 3 months ago

For the whole genome mode, Yes, the average coverage is calculated through the entire genome. When the user checks into the specific genome, the average coverage is re-calculated. If the user zooms into a region, the mean coverage will be changed dynamically based on the zoomed region.

For the scenario you described, Do you have a specific chromosome I can look at? The closest case to what you mentioned is that I found the Y chromosome has only 4 bars. When zooming into one of them, the mean coverage is too small to show on the Y-axis.

YangQi007 commented 3 months ago

I want to corret myself for the above comment. As I tried to reproduce the scenario, I found that the mean coverage is not too small to show, it is too big. It excels the Y-axis boundary when zooming into a small region of data. E.g. chromosome Y.

This reminds me that my first solution made the Y-axis change dynamically. This solution would solve the problem that the average coverage label is out of the Y-axis boundary.

anderspitman commented 3 months ago

Dynamically setting the Y axis in some cases might be a good idea, but it's not going to solve this problem. Here's an example file you can use to see what I'm talking about:

https://iobio.s3.amazonaws.com/samples/bam/NA12878.exome.bam

When I load that file in bam2, it shows the Y-axis from 0x to about 3x. But if you look at the data, you can see there are many spikes going through the top of the chart. These spikes actually represent 30x coverage in those areas. This is caused by having an alignment file that only has data covering the exome, which is where genes are.

What we want to see here is the Y-axis showing 0x to about 60x, so those spikes are about halfway up the chart. One thing that might work would be to throw away and 0 values when calculating the average. I think we should try that first.

Another option is to us a BED file to limit the regions considered. This is what bam1 does, but it looks like it currently only recalculates the average. It doesn't actually update the chart to reflect a new Y axis.

anderspitman commented 3 months ago

You can load that sample file into bam.iobio.io and click "GRCh37 Exonic Regions" to see what I mean.

anderspitman commented 3 months ago

@YangQi007 per our Slack discussion, let's plan on addressing this by implementing BED file support, similar to bam1.0

anderspitman commented 3 months ago

@YangQi007 you can expect data in the following format for region selection:

{
  regions: [
    {
      rname: String(), // reference sequence name, aka refName, chromosome name
      start: Number(), // start index, 0-based
      end: Number(), // end index, not inclusive
    },
    {
      rname: String(),
      start: Number(),
      end: Number(),
    },
  ]
}
YangQi007 commented 2 months ago

@anderspitman I believe that we can only get the region information from the Bed file, how do we get the actual read depth at each position from the Bam file? Since the endpoint only gives us the read depth for 16384 base pairs.

anderspitman commented 2 months ago

I think I see what you're saying. The data broker probably needs to give the read depth chart data that's already filtered according to the bed file. I'll look into this

anderspitman commented 2 months ago

I believe this is pretty much working at this point.