Closed sigven closed 2 years ago
Hi,
read_gtf
is a simple function that uses rtracklayer::import
for importing the gtf, then formats the output to be compatible with valr
. This issue has been mentioned in the github repo for rtracklayer (issue # 54), so perhaps it will be fixed upstream in their code base. In the meantime you could use a regular expression to extract the duplicated tag fields.
Something like this could work and doesn't take too long to run. (not thoroughly tested, use at your own risk)
library(purrr)
library(stringr)
library(valr)
gtf_fn <- tempfile(fileext = ".gtf.gz")
download.file("https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.basic.annotation.gtf.gz",
gtf_fn)
# read in lines without formatting
gtf_lines <- readLines(gtf_fn)
# remove header lines
gtf_lines <- gtf_lines[!startsWith(gtf_lines, "##")]
# extract and contatenate tag fields
tag_cols <- str_match_all(gtf_lines, "tag \\\"(\\S+)\\\";") %>%
map_chr(~str_c(.x[, ncol(.x)], collapse = "&"))
# read in gtf as data.frame
gtf <- read_gtf(gtf_fn)
# add tags as new col
gtf$tag_cols <- tag_cols
gtf[1:10, c("tag", "tag_cols")]
#> # A tibble: 10 × 2
#> tag tag_cols
#> <chr> <chr>
#> 1 <NA> ""
#> 2 basic "basic"
#> 3 basic "basic"
#> 4 basic "basic"
#> 5 basic "basic"
#> 6 Ensembl_canonical "basic&Ensembl_canonical"
#> 7 Ensembl_canonical "basic&Ensembl_canonical"
#> 8 Ensembl_canonical "basic&Ensembl_canonical"
#> 9 Ensembl_canonical "basic&Ensembl_canonical"
#> 10 Ensembl_canonical "basic&Ensembl_canonical"
Created on 2021-12-15 by the reprex package (v2.0.0)
Hi,
Thanks for the swift response and clarification! Did not catch the previously reported issue, sorry about that. Already have a large ugly code going on to resolve this, but considered using your read_gtf would make my life a lot easier. I'll keep an eye on updates with rtracklayer.
kind regards, Sigve
Hi,
I just tried your
read_gtf()
function, and it works really nice! I just observed one minor issue when reading GENCODE transcripts that I thought I should mention. In the attribute column, there are multiple tag-value pairs coming from the same tag. E.g.tag "basic"; tag "Ensembl_canonical";
From your
read_gtf()
function only a single value is returned when such cases occur, and that is not ideal. I'd rather see that the multiple values were concatenated in thetag
column of the resulting tibble. Using the example above, rather than listing only _Ensemblcanonical in thetag
column, I'd like _basic&Ensemblcanonical (or any other separator of preference)Hope this makes somewhat sense.
kind regards, Sigve