scverse / genomic-features

Genomic Features in Python from BioConductor's AnnotationHub
https://genomic-features.readthedocs.io
BSD 3-Clause "New" or "Revised" License
18 stars 5 forks source link

What order should results be in? #62

Open ivirshup opened 3 months ago

ivirshup commented 3 months ago

Description of feature

Continuing from https://github.com/scverse/genomic-features/pull/59#issuecomment-2034149437

In the mentioned PR, we found out that duckdb is inconsistent with what order it returns results in. This by and large seems fine from a duckdb point of view, but would be frustrating for users of this package.

To solve this @thomas-reimonn added an order_by statement here

Right now the order returned is based on the order of columns in the user input (I believe). Is there a better/ more canonical way to do this? Off the top of my head I would assume we'd generally want to sort by chrom and start.

We should also check what the bioc packages do here.

@nvictus, @emdann, @lauradmartens any thoughts?

ivirshup commented 3 months ago
emdann commented 3 months ago

Definitely default to genomic location (chr+start), option to use another column?

thomas-reimonn commented 3 months ago

Some of these tables don't have loci information. E.g. ibis.common.exceptions.IbisTypeError: Column 'chrom' is not found in table. Existing columns: 'gene_id', 'gene_name', 'gene_biotype', 'gene_seq_start', 'gene_seq_end', 'seq_name', 'seq_strand', 'seq_coord_system', 'description', 'gene_id_version', 'canonical_transcript'.

ivirshup commented 3 months ago

@thomas-reimonn Ah, I think chrom should be seq_name

@emdann, for something like exons, where "gene_id" or "transcript_id" has been selected, so we still sort by chromosome and start? Or should we sort by the gene/ transcript start?

And then, do we consider strand for sorting exons within a transcript?

ivirshup commented 3 months ago

It looks like tx2exon has exon_idx which I believe is the order of the exon inside the transcript

lauradmartens commented 3 months ago

I also agree by default it should be chrom + start. It seems like this is also what GenomicFeatures does and they have a separate function that returns ranges sorted by another value I think: e.g. transcriptsBy(txdb, "gene") and then the exons are sorted how they would appear in the transcript: In the manual on transcriptsBy/exonsBy:

These functions return a GRangesList object where the ranges within each of the elements are ordered according to the following rule: When using exonsBy or cdsBy with by="tx", the returned exons or CDS parts are ordered by ascending rank for each transcript, that is, by their position in the transcript. In all other cases, the transcriptsBy ranges will be ordered by chromosome, strand, start, and end values.

ivirshup commented 3 months ago

I think the bioc packages go by chrom + start for the "primary" thing being queried (e.g. genes for .genes). But it looks like there is some strand specific behaviour in both if you grab exons within a gene query. Here's where I got to:

Demo ## Setup ```r library(ensembldb) library(EnsDb.Hsapiens.v86) edb = EnsDb.Hsapiens.v86 library(GenomicFeatures) samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(samplefile) ``` ## `ensembldb` getting genes but also returning exons. ### Only `-` strand: ```r > head(genes( edb, columns=c("exon_seq_start", "exon_seq_end", "exon_id"), filter=SeqStrandFilter("-") )) GRanges object with 6 ranges and 4 metadata columns: seqnames ranges strand | exon_seq_start exon_seq_end exon_id gene_id | ENSG00000227232 1 14404-29570 - | 29534 29570 ENSE00001890219 ENSG00000227232 ENSG00000227232 1 14404-29570 - | 24738 24891 ENSE00003507205 ENSG00000227232 ENSG00000227232 1 14404-29570 - | 18268 18366 ENSE00003477500 ENSG00000227232 ENSG00000227232 1 14404-29570 - | 17915 18061 ENSE00003565697 ENSG00000227232 ENSG00000227232 1 14404-29570 - | 17606 17742 ENSE00003475637 ENSG00000227232 ENSG00000227232 1 14404-29570 - | 17233 17368 ENSE00003502542 ENSG00000227232 ------- seqinfo: 308 sequences (1 circular) from GRCh38 genome ``` ### `+ strand` ```r > head(genes( edb, columns=c("exon_seq_start", "exon_seq_end", "exon_id"), filter=SeqStrandFilter("+") )) GRanges object with 6 ranges and 4 metadata columns: seqnames ranges strand | exon_seq_start exon_seq_end exon_id gene_id | ENSG00000223972 1 11869-14409 + | 11869 12227 ENSE00002234944 ENSG00000223972 ENSG00000223972 1 11869-14409 + | 12613 12721 ENSE00003582793 ENSG00000223972 ENSG00000223972 1 11869-14409 + | 13221 14409 ENSE00002312635 ENSG00000223972 ENSG00000223972 1 11869-14409 + | 12010 12057 ENSE00001948541 ENSG00000223972 ENSG00000223972 1 11869-14409 + | 12179 12227 ENSE00001671638 ENSG00000223972 ENSG00000223972 1 11869-14409 + | 12613 12697 ENSE00001758273 ENSG00000223972 ------- seqinfo: 335 sequences (1 circular) from GRCh38 genome ``` ## `GenomicFeatures` ```r > head(genes( txdb, columns=c("exon_start", "exon_end", "exon_id"), filter=list(tx_strand=c("-")) )) 1 gene was dropped because it has exons located on both strands of the same reference sequence or on more than one reference sequence, so cannot be represented by a single genomic range. Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList object, or use suppressMessages() to suppress this message. GRanges object with 6 ranges and 3 metadata columns: seqnames ranges strand | exon_start exon_end exon_id | 1 chr19 58858172-58874214 - | 58864770,58864658,58864294,... 58864865,58864693,58864563,... 461,460,459,... 10186 chr13 39917029-40177356 - | 40177020,40174969,39952565,... 40177356,40175527,39952663,... 347,346,345,... 131669 chr3 126200008-126236616 - | 126236437,126229507,126228428,... 126236616,126229637,126228521,... 170,169,168,... 139716 chrX 153903527-153979858 - | 153979229,153944301,153941482,... 153979348,153944604,153941698,... 564,563,561,... 201725 chr4 159587827-159593407 - | 159592768,159587827,159593149 159593202,159590920,159593407,... 180,179,181,... 221150 chr13 21727734-21750741 - | 21750514,21746759,21746478,... 21750741,21746820,21746643,... 343,342,341,... ------- seqinfo: 93 sequences (1 circular) from hg19 genome ```