hoffmangroup / genomedata

The Genomedata format for storing large-scale functional genomics data.
https://genomedata.hoffmanlab.org/
GNU General Public License v2.0
2 stars 1 forks source link

Full BED File Support #11

Closed EricR86 closed 10 years ago

EricR86 commented 10 years ago

Original report (archived issue) by Coby Viner (Bitbucket: cviner2, GitHub: cviner).


This is marked as an enhancement due to the following line in the documentation: "BED3+1 format is interpreted the same ways as bedGraph, except that the track definition line is not required." Still, it might be nice if this behavior were more visibly documented in the interim.

BED files must currently contain a fourth column which must be a floating point number. If either no fourth column is provided or one is provided that is non-numeric, the following error occurs:

#!text
unexpected non-newline character after bedGraph dataValue

Examples of error-inducing BED files:

#!text
chrY    23838831    23838832
chrY    28266706    28266707
chrY    28266740    28266741
#!text
chrY    23838831    23838832    K562_Rep4_RRBS
chrY    28266706    28266707    K562_Rep3_RRBS
chrY    28266740    28266741    K562_Rep3_RRBS

Example of valid BED file:

#!text
chrY    23838831    23838832    4
chrY    28266706    28266707    3
chrY    28266740    28266741    3

It would be ideal if Genomedata fully supported the UCSC BED Specification (by simply ignoring extraneous columns). Otherwise, it would be nice to allow 3-column BED files and BED3+1 files where the 4th column can be a string.

EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


What is the desired behavior when there is no valid number in the fourth column?

EricR86 commented 10 years ago

Original comment by Coby Viner (Bitbucket: cviner2, GitHub: cviner).


Ideally, no numeric value would be set for such entries. This would be equivalent to saying that there is some X (of unspecified quantity or score) at the given position. A value of 1 (for example) could be used for all values or perhaps all values could be set to NaN.

Instead, an over-riding of chromosome.name could be implemented to return the corresponding strings (i.e. the entries' names) or the empty string (or strings) for a 3-column BED. Therefore, chromosome.name would continue to return the name of the chromosome itself when called upon the chromosome (i.e. scalar context), but would return a list of names if invoked upon any chromosomal interval (i.e. list context). This list of names would correspond to each of the names of all BED entries within the given interval. For BED3+1, with valid numbers in the fourth column, chromosome[s,e].name could either return the empty string or perhaps return the numbers in the fourth column (i.e. chromosome[s,e]). While the latter would be superfluous, it would most closely mimic the concept of the fourth column in BED files.

EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


This would require big changes to the current design concept that I'm not prepared for.

EricR86 commented 10 years ago

Original comment by Coby Viner (Bitbucket: cviner2, GitHub: cviner).


Fair enough.