Multiple tag pairs from same tag missed in read_gtf()

sigven commented 2 years ago

Hi,

I just tried your read_gtf() function, and it works really nice! I just observed one minor issue when reading GENCODE transcripts that I thought I should mention. In the attribute column, there are multiple tag-value pairs coming from the same tag. E.g.

tag "basic"; tag "Ensembl_canonical";

From your read_gtf() function only a single value is returned when such cases occur, and that is not ideal. I'd rather see that the multiple values were concatenated in the tag column of the resulting tibble. Using the example above, rather than listing only _Ensemblcanonical in the tag column, I'd like _basic&Ensemblcanonical (or any other separator of preference)

Hope this makes somewhat sense.

kind regards, Sigve

kriemo commented 2 years ago

Hi,
read_gtf is a simple function that uses rtracklayer::import for importing the gtf, then formats the output to be compatible with valr. This issue has been mentioned in the github repo for rtracklayer (issue # 54), so perhaps it will be fixed upstream in their code base. In the meantime you could use a regular expression to extract the duplicated tag fields.

Something like this could work and doesn't take too long to run. (not thoroughly tested, use at your own risk)

library(purrr)
library(stringr)
library(valr)

gtf_fn <- tempfile(fileext = ".gtf.gz")
download.file("https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.basic.annotation.gtf.gz",
              gtf_fn)
# read in lines without formatting
gtf_lines <- readLines(gtf_fn)

# remove header lines
gtf_lines <- gtf_lines[!startsWith(gtf_lines, "##")]

# extract and contatenate tag fields
tag_cols <- str_match_all(gtf_lines, "tag \\\"(\\S+)\\\";")  %>% 
  map_chr(~str_c(.x[, ncol(.x)], collapse = "&"))

# read in gtf as data.frame
gtf <- read_gtf(gtf_fn)

# add tags as new col
gtf$tag_cols <- tag_cols

gtf[1:10, c("tag", "tag_cols")] 
#> # A tibble: 10 × 2
#>    tag               tag_cols                 
#>    <chr>             <chr>                    
#>  1 <NA>              ""                       
#>  2 basic             "basic"                  
#>  3 basic             "basic"                  
#>  4 basic             "basic"                  
#>  5 basic             "basic"                  
#>  6 Ensembl_canonical "basic&Ensembl_canonical"
#>  7 Ensembl_canonical "basic&Ensembl_canonical"
#>  8 Ensembl_canonical "basic&Ensembl_canonical"
#>  9 Ensembl_canonical "basic&Ensembl_canonical"
#> 10 Ensembl_canonical "basic&Ensembl_canonical"

^{Created on 2021-12-15 by the reprex package (v2.0.0)}

sigven commented 2 years ago

Hi,

Thanks for the swift response and clarification! Did not catch the previously reported issue, sorry about that. Already have a large ugly code going on to resolve this, but considered using your read_gtf would make my life a lot easier. I'll keep an eye on updates with rtracklayer.

kind regards, Sigve

rnabioco / valr

Multiple tag pairs from same tag missed in read_gtf() #384