Is unique seqnames required for Proteins?

jorainer commented 8 years ago

@sgibb @lgatto : I have a conceptual question: the Proteins object requires unique seqnames and names of the pranges. Was there any specific reason for that? I'm asking because GRanges or GRangesList (which could be thought as being somewhat equivalent to the Proteins object, just for RNA/DNA features) don't:

library(GenomicFeatures)
ir <- IRanges(start = c(4, 5, 20), end = c(12, 35, 34))
gr <- GRanges(seqnames = rep(1, 3), ranges = ir)
mcols(gr) <- DataFrame(tx_id = c("tx_1", "tx_2", "tx_1"))
## Setting non-unique names:
names(gr) <- c("a", "b", "a")
gr
GRanges object with 3 ranges and 1 metadata column:
    seqnames    ranges strand |       tx_id
       <Rle> <IRanges>  <Rle> | <character>
  a        1  [ 4, 12]      * |        tx_1
  b        1  [ 5, 35]      * |        tx_2
  a        1  [20, 34]      * |        tx_1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

## And for GRangesList:
grL <- split(gr, gr$tx_id)
names(grL) <- c("tx_3", "tx_3")
names(grL)
[1] "tx_3" "tx_3"

sgibb commented 8 years ago

I'm asking because GRanges or GRangesList (which could be thought as being somewhat equivalent to the Proteins object, just for RNA/DNA features) don't:

Indeed our aim is to provide a similar structure for Proteins/Peptides as GRanges provides for D/RNA.

I have a conceptual question: the Proteins object requires unique seqnames and names of the pranges. Was there any specific reason for that?

To be honest I didn't know that it is possible to assign the same names to GRanges/GRangesList. IMHO there was no specific reason. I just thought that something like non-unique seqnames could not happen or would be error-prone. @lgatto do you remember any other reason?

jorainer commented 8 years ago

I stumbled across this because I was creating a Proteins object based on all proteins that are defined in Ensembl. Usually, transcript to proteins mapping in Ensembl is 1:1, but not for LRG (locus reference genes) where multiple different transcripts seem to be associated to the same protein ID.

Another problem is the n:m mapping between Uniprot IDs and Ensembl protein IDs - these will cause problems. In the end I think it would be more flexible to allow non-unique names - subsetting by name will be tricky though.

jorainer commented 7 years ago

Just picking this up again since I started hacking again a little in my Pbase fork - @lgatto what's your opinion on that? Is unique seqnames for Proteins object really required? If possible I would like to drop that requirement.

lgatto commented 7 years ago

Unique names are sometimes a good restriction, but due to mapping ambiguities, I agree that it might be a good thing to change.

How would things work if I have an object with proteins A, B, A and C, and I subset by name with x["A"]? Do I get items 1 and 3?

jorainer commented 7 years ago

No, subsetting by name will only return the first occurence:

> gr["a"]
GRanges object with 1 range and 1 metadata column:
    seqnames    ranges strand |       tx_id
       <Rle> <IRanges>  <Rle> | <character>
  a        1   [4, 12]      * |        tx_1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

To get all you have to use:

> gr[names(gr) == "a"]
GRanges object with 2 ranges and 1 metadata column:
    seqnames    ranges strand |       tx_id
       <Rle> <IRanges>  <Rle> | <character>
  a        1  [ 4, 12]      * |        tx_1
  a        1  [20, 34]      * |        tx_1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

It' somewhat similar to matrix that also allows redundant rownames and returns only the first match if subsetted by name.

lgatto commented 7 years ago

I find that a bit ugly. I suppose one could get around that with a filterBySeqname function and keep the default behaviour. (Although I don't like having filterBy functions replacing default operators like [, [[, ...)

sgibb commented 7 years ago

sry, I will keep this open for the filterBySeqname method.

lgatto / Pbase

Is unique seqnames required for Proteins? #28