Open mccalluc opened 7 years ago
Shannan writes:
Hi Chuck, in the first case, these are two regions that have different names but the same coordinates. This happens sometimes when looking at individual transcripts that may have arisen in different experiments or even in the same experiment. Years ago, I used to look at these for evidence of real transcription rather than a spurious event. You're correct about the second case being alternate isoforms/transcripts of a single gene. For visualization, it's fine to collapse to the smallest start and largest end coordinates
In the first case, I think the rows to be combined will be adjacent, because the whole list is sorted by begin/end addresses.
In the second case, this is not true:
$ cut -f 4 ~/Downloads/refGene.hg19.bed | head
DDX11L1
WASH7P
MIR6859-1
MIR6859-2
MIR6859-3
MIR6859-4
MIR1302-2
MIR1302-9
MIR1302-10
MIR1302-11
$ cut -f 4 ~/Downloads/refGene.hg19.bed | wc -l
61788
$ cut -f 4 ~/Downloads/refGene.hg19.bed | uniq | wc -l
30004
$ cut -f 4 ~/Downloads/refGene.hg19.bed | sort | uniq | wc -l
27109
Looks like there's another case intermediate between the other two:
chr1 323891 328581 LOC100132* 1000 + 328581 328581 . 3 169,58,4143 0,396,547
chr1 323891 328581 LOC100133331 1000 + 328581 328581 . 4 169,58,2500,1546 0,396,547,3144
LOC100133331
happens to have the same start and end, but it has different exons. It's clearly related to the others, but the logic currently demands that all fields except for the name match. Should that be loosened to just checking the start and the end? I'm not sure that it should: The earliest-start and the latest-end depend on the particular exons covered, so I don't think we could rely on it for identifying duplicates like this.
Another odd case:
chr1 323891 328581 LOC100133331 1000 + 328581 328581 . 4 169,58,2500,1546 0,396,547,3144
chr1 661138 665731 LOC100133331 1000 - 665731 665731 . 3 4046,58,169 0,4139,4424
I had assumed that the direction would be the same on all records that share a name, but that is not the case. What should be the output in a case like this?
For the moment, I'll choose one arbitrarily, but not sure what is best for the long term.
@gmnelson : The thread here is a little confusing, but the end result is at https://refinery-platform.github.io/get-reference-genomes/: What I need is feedback from someone with a biological background that the approach I've taken to condensing makes sense, and then I'll run it on s3, and point the new igv wrapper at it.
@gmnelson : The examples are all in terms of BED files, which I think is a standard? ... but I wanted to be able to be able to start from any genome on UCSC, and there, instead of havinging BED files available, they only have a dump from their database with a different column order: refgene-ucsc-to-bed.py makes that translation.
As far as the particular BED file I was looking at, I think it was hg19 from Broad.
Does this help? Grab me in person if this still isn't the info you need.
Right now, when I look at the BED file, there are lines like this:
where the start and end addresses are repeated, and the only thing that differs are the names. What are these? Is it ever useful to display them separately, or should they always be collapsed, and given a name like "LOC100132287 / LOC100132062"
There are also cases like:
which I think are alternate transcriptions. Would it work for the output to look like this?:
... or would it be good to union the exon sets, to preserve some sense of the internal structure?