What is the right way to collapse a BED file?

mccalluc commented 7 years ago

Right now, when I look at the BED file, there are lines like this:

chr1    323891  328581  LOC100132287    1000    +       328581  328581  .       3       169,58,4143     0,396,547
chr1    323891  328581  LOC100132062    1000    +       328581  328581  .       3       169,58,4143     0,396,547

where the start and end addresses are repeated, and the only thing that differs are the names. What are these? Is it ever useful to display them separately, or should they always be collapsed, and given a name like "LOC100132287 / LOC100132062"

There are also cases like:

chr1    762970  778984  LINC01128       1000    +       778984  778984  .       3       185,102,2405    0,1412,13609
chr1    762970  794826  LINC01128       1000    +       794826  794826  .       6       185,102,153,184,96,6056 0,1412,20063,24336,25080,25800
chr1    762970  794826  LINC01128       1000    +       794826  794826  .       5       185,102,184,96,6056     0,1412,24336,25080,25800
chr1    762970  794826  LINC01128       1000    +       794826  794826  .       4       185,102,184,6056        0,1412,24336,25800
chr1    762970  794826  LINC01128       1000    +       794826  794826  .       4       185,102,96,6056 0,1412,25080,25800
chr1    763177  794826  LINC01128       1000    +       794826  794826  .       5       52,102,184,96,6056      0,1205,24129,24873,25593

which I think are alternate transcriptions. Would it work for the output to look like this?:

chr1    762970  794826  LINC01128       1000    +       794826  794826  .       1       1856      0

... or would it be good to union the exon sets, to preserve some sense of the internal structure?

mccalluc commented 7 years ago

Shannan writes:

Hi Chuck, in the first case, these are two regions that have different names but the same coordinates. This happens sometimes when looking at individual transcripts that may have arisen in different experiments or even in the same experiment. Years ago, I used to look at these for evidence of real transcription rather than a spurious event. You're correct about the second case being alternate isoforms/transcripts of a single gene. For visualization, it's fine to collapse to the smallest start and largest end coordinates

mccalluc commented 7 years ago

In the first case, I think the rows to be combined will be adjacent, because the whole list is sorted by begin/end addresses.

In the second case, this is not true:

$ cut -f 4 ~/Downloads/refGene.hg19.bed | head
DDX11L1
WASH7P
MIR6859-1
MIR6859-2
MIR6859-3
MIR6859-4
MIR1302-2
MIR1302-9
MIR1302-10
MIR1302-11
$ cut -f 4 ~/Downloads/refGene.hg19.bed | wc -l
   61788
$ cut -f 4 ~/Downloads/refGene.hg19.bed | uniq | wc -l
   30004
$ cut -f 4 ~/Downloads/refGene.hg19.bed | sort | uniq | wc -l
   27109

mccalluc commented 7 years ago

Looks like there's another case intermediate between the other two:

chr1    323891  328581  LOC100132*      1000    +       328581  328581  .       3       169,58,4143     0,396,547
chr1    323891  328581  LOC100133331    1000    +       328581  328581  .       4       169,58,2500,1546        0,396,547,3144

LOC100133331 happens to have the same start and end, but it has different exons. It's clearly related to the others, but the logic currently demands that all fields except for the name match. Should that be loosened to just checking the start and the end? I'm not sure that it should: The earliest-start and the latest-end depend on the particular exons covered, so I don't think we could rely on it for identifying duplicates like this.

mccalluc commented 7 years ago

Another odd case:

chr1    323891  328581  LOC100133331    1000    +   328581  328581  .   4   169,58,2500,1546    0,396,547,3144
chr1    661138  665731  LOC100133331    1000    -   665731  665731  .   3   4046,58,169 0,4139,4424

I had assumed that the direction would be the same on all records that share a name, but that is not the case. What should be the output in a case like this?

For the moment, I'll choose one arbitrarily, but not sure what is best for the long term.

mccalluc commented 7 years ago

@gmnelson : The thread here is a little confusing, but the end result is at https://refinery-platform.github.io/get-reference-genomes/: What I need is feedback from someone with a biological background that the approach I've taken to condensing makes sense, and then I'll run it on s3, and point the new igv wrapper at it.

mccalluc commented 7 years ago

@gmnelson : The examples are all in terms of BED files, which I think is a standard? ... but I wanted to be able to be able to start from any genome on UCSC, and there, instead of havinging BED files available, they only have a dump from their database with a different column order: refgene-ucsc-to-bed.py makes that translation.

As far as the particular BED file I was looking at, I think it was hg19 from Broad.

Does this help? Grab me in person if this still isn't the info you need.

refinery-platform / get-reference-genomes

What is the right way to collapse a BED file? #9