Open ivirshup opened 3 months ago
order.by
argumentDefinitely default to genomic location (chr+start), option to use another column?
Some of these tables don't have loci information. E.g. ibis.common.exceptions.IbisTypeError: Column 'chrom' is not found in table. Existing columns: 'gene_id', 'gene_name', 'gene_biotype', 'gene_seq_start', 'gene_seq_end', 'seq_name', 'seq_strand', 'seq_coord_system', 'description', 'gene_id_version', 'canonical_transcript'.
@thomas-reimonn Ah, I think chrom
should be seq_name
@emdann, for something like exons
, where "gene_id" or "transcript_id" has been selected, so we still sort by chromosome and start? Or should we sort by the gene/ transcript start?
And then, do we consider strand for sorting exons within a transcript?
It looks like tx2exon
has exon_idx
which I believe is the order of the exon inside the transcript
I also agree by default it should be chrom + start. It seems like this is also what GenomicFeatures does and they have a separate function that returns ranges sorted by another value I think: e.g. transcriptsBy(txdb, "gene")
and then the exons are sorted how they would appear in the transcript:
In the manual on transcriptsBy/exonsBy
:
These functions return a GRangesList object where the ranges within each of the elements are ordered according to the following rule: When using exonsBy or cdsBy with by="tx", the returned exons or CDS parts are ordered by ascending rank for each transcript, that is, by their position in the transcript. In all other cases, the
transcriptsBy
ranges will be ordered by chromosome, strand, start, and end values.
I think the bioc packages go by chrom + start for the "primary" thing being queried (e.g. genes for .genes
). But it looks like there is some strand specific behaviour in both if you grab exons within a gene
query. Here's where I got to:
Description of feature
Continuing from https://github.com/scverse/genomic-features/pull/59#issuecomment-2034149437
In the mentioned PR, we found out that duckdb is inconsistent with what order it returns results in. This by and large seems fine from a duckdb point of view, but would be frustrating for users of this package.
To solve this @thomas-reimonn added an
order_by
statement hereRight now the order returned is based on the order of columns in the user input (I believe). Is there a better/ more canonical way to do this? Off the top of my head I would assume we'd generally want to sort by
chrom
andstart
.We should also check what the bioc packages do here.
@nvictus, @emdann, @lauradmartens any thoughts?