lawremi / rtracklayer

R interface to genome annotation files and the UCSC genome browser
Other
29 stars 17 forks source link

import() does not handle multiple tag attributes #54

Open pcantalupo opened 3 years ago

pcantalupo commented 3 years ago

Hello, I'm trying to import the following very simple GTF file with 1 line (comes from Gencode v38 GTF here) that has multiple tag attribues:

$ cat multipletag.gtf 
chr6    HAVANA  transcript  10723070    10731127    .   +   .   gene_id "ENSG00000111843.14"; transcript_id "ENST00000229563.6"; gene_type "protein_coding"; gene_name "TMEM14C"; transcript_type "protein_coding"; transcript_name "TMEM14C-201"; level 2; protein_id "ENSP00000229563.5"; transcript_support_level "1"; hgnc_id "HGNC:20952"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS4514.1"; havana_gene "OTTHUMG00000014242.2"; havana_transcript "OTTHUMT00000039829.2";

When I import it into R, only the last tag attribute, CCDS, is parsed:

> library(rtracklayer)
> gtf = import("~/tmp/multipletag.gtf")
> gtf
GRanges object with 1 range and 18 metadata columns:
      seqnames            ranges strand |   source       type     score     phase
         <Rle>         <IRanges>  <Rle> | <factor>   <factor> <numeric> <integer>
  [1]     chr6 10723070-10731127      + |   HAVANA transcript        NA      <NA>
                 gene_id     transcript_id      gene_type   gene_name transcript_type
             <character>       <character>    <character> <character>     <character>
  [1] ENSG00000111843.14 ENST00000229563.6 protein_coding     TMEM14C  protein_coding
      transcript_name       level        protein_id transcript_support_level
          <character> <character>       <character>              <character>
  [1]     TMEM14C-201           2 ENSP00000229563.5                        1
          hgnc_id         tag      ccdsid          havana_gene    havana_transcript
      <character> <character> <character>          <character>          <character>
  [1]  HGNC:20952        CCDS  CCDS4514.1 OTTHUMG00000014242.2 OTTHUMT00000039829.2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> 

How do I get rtracklayer to preserve all tag attributes for each GTF line?

sigven commented 2 years ago

Hi, I experience the same issue. See my related issue. I used read_gtf() from valr that further depends on functionality from rtracklayer.