gorpipe / gor

GORpipe is a tool based on a genomic ordered relational architecture and allows analysis of large sets of genomic and phenotypic tabular data using declarative query language, in a parallel execution engine.
GNU Affero General Public License v3.0
39 stars 13 forks source link

Segspan for contiguous regions #10

Open hemingur opened 4 years ago

hemingur commented 4 years ago

Dear GOR team,

I would like to suggest a new option for the SEGSPAN command to allow for joining intervals where both endpoints are included. For example, if we call this option -c and have the data:

Chr Begin End
chr1 10 19
chr1 20 29

in file foo, then the command

gorpipe "gor foo | segspan -c"

would return:

Chr bpStart bpStop segCount
chr1 10 29 2

Thanks, Gunnar

vidarhr commented 4 years ago

Thank you @hemingur for the suggestion. We will evaluate.

gorfather commented 4 years ago

SEGSPAN does actually combine segments that are adjacent. In the example above, the segments are however not adjacent because segment notation in GOR is zero-based (to be compatible with UCSC data), hence base 19 is not part of segment 2.

From the GOR-help https://docs.gorpipe.org/joiningTables.html?highlight=ucsc Segment ranges in GOR are zero-based UCSC style, e.g. (start,stop)=(100,200) denotes a genomic segment including bases 101-200, i.e. of length 100bp.

One could ask the question if we should support one-based segments. At present we don't but it might be worth having such option in few segment oriented commands (like SEGSPAN) or a configuration parameter to turn off zero-based.

hemingur commented 4 years ago

Thanks for looking into this - I was actually aware that GOR was zero based. I probably was not clear enough in my comment earlier - by "both endpoints are included" I meant for SEGSPAN to behave as if the segments are one-based.

gorfather commented 4 years ago

Yes - we should look into having an option to treat segments as one-based.