Open jorainer opened 8 years ago
I'm asking because
GRanges
orGRangesList
(which could be thought as being somewhat equivalent to the Proteins object, just for RNA/DNA features) don't:
Indeed our aim is to provide a similar structure for Proteins/Peptides as GRanges
provides for D/RNA.
I have a conceptual question: the
Proteins
object requires uniqueseqnames
and names of thepranges
. Was there any specific reason for that?
To be honest I didn't know that it is possible to assign the same names to GRanges
/GRangesList
. IMHO there was no specific reason. I just thought that something like non-unique seqnames
could not happen or would be error-prone. @lgatto do you remember any other reason?
I stumbled across this because I was creating a Proteins
object based on all proteins that are defined in Ensembl. Usually, transcript to proteins mapping in Ensembl is 1:1, but not for LRG (locus reference genes) where multiple different transcripts seem to be associated to the same protein ID.
Another problem is the n:m mapping between Uniprot IDs and Ensembl protein IDs - these will cause problems. In the end I think it would be more flexible to allow non-unique names - subsetting by name will be tricky though.
Just picking this up again since I started hacking again a little in my Pbase
fork - @lgatto what's your opinion on that? Is unique seqnames
for Proteins
object really required? If possible I would like to drop that requirement.
Unique names are sometimes a good restriction, but due to mapping ambiguities, I agree that it might be a good thing to change.
How would things work if I have an object with proteins A
, B
, A
and C
, and I subset by name with x["A"]
? Do I get items 1 and 3?
No, subsetting by name will only return the first occurence:
> gr["a"]
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <character>
a 1 [4, 12] * | tx_1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
To get all you have to use:
> gr[names(gr) == "a"]
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <character>
a 1 [ 4, 12] * | tx_1
a 1 [20, 34] * | tx_1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
It' somewhat similar to matrix
that also allows redundant rownames
and returns only the first match if subsetted by name.
I find that a bit ugly. I suppose one could get around that with a filterBySeqname
function and keep the default behaviour. (Although I don't like having filterBy
functions replacing default operators like [
, [[
, ...)
sry, I will keep this open for the filterBySeqname
method.
@sgibb @lgatto : I have a conceptual question: the
Proteins
object requires uniqueseqnames
and names of thepranges
. Was there any specific reason for that? I'm asking becauseGRanges
orGRangesList
(which could be thought as being somewhat equivalent to theProteins
object, just for RNA/DNA features) don't: