gmbecker / genbankr

http://bioconductor.org/packages/devel/bioc/html/genbankr.html
14 stars 9 forks source link

error with makeTxDbFromGenBank #15

Open jayoung opened 2 years ago

jayoung commented 2 years ago

hi there,

I just found your package - looks really useful: thank you!

I had no trouble reading in my genbank sequence using readGenBank, but when I tried to use makeTxDbFromGenBank I encountered some errors. I got a little way down the road in trying to hack a solution before giving up - I've included notes below.

It suspect it is something about the specific Genbank entry I'm trying to use: it's a viral genome, so most genes are intronless, and most genes lack a transcript annotation and lack exon annotation (most have only the gene and CDS). I don't know how common this situation is, so don't know how worthwhile it will be for you to fix it.

But as I started looking at the code, I thought it might not be too difficult to fix it, and I would guess it is might be common enough for genomes like viruses and simple eukaryotes like S.cerevisiae and friends that it could be good motivation.

library(genbankr)

# this works
ref <- readGenBank(GBAccession("JQ673480.1"))

##### makeTxDbFromGenBank error #1:
ref_txdb <- makeTxDbFromGenBank(ref, reassign.ids = FALSE)
# Error in data.frame(tx_id = as.integer(factor(txgr$transcript_id)), gene_id = as.character(txgr$gene_id),  : 
#  arguments imply differing number of rows: 3, 0

I started playing around with the makeTxDbFromGenBank function and was able to get past this error by changing line 87 of txdb.R as follows:

gene_id = as.character(txgr$gene_id)
# changed to to 
gene_id = as.character(txgr$gene)

After fixing that I tried again:

###### error # 2
ref_txdb <- makeTxDbFromGenBank_JY(ref, reassign.ids = FALSE)
##  Error in .check_foreign_key(splicings$tx_id, "integer", "splicings$tx_id",  : 'splicings$tx_id' cannot contain NAs 

I was able to get past that one by changing line 71 of txdb.R

splicedf = data.frame(tx_id = chartxidToInt_JY(cdsgr$transcript_id, txgr$transcript_id), ...
## to 
tx_id = chartxidToInt_JY( cdsgr$transcript_id ), ...

and on trying again:

###### error # 3
makeTxDbFromGenBank_JY(ref, reassign.ids = FALSE)
#Error in .check_foreign_key(splicings$tx_id, "integer", "splicings$tx_id",  : 
#      all the values in 'splicings$tx_id' must be present in #'transcripts$tx_id' 

I can see that the problem is stemming from the fact I have many CDSs without a corresponding transcript. The help page ?readGenBank sheds some light on why that is, for this genome: In files where transcripts are not present, 'approximate transcripts' defined by the ranges spanned by groups of exons are used. Currently, we do not support generating approximate transcripts from CDSs in files that contain actual transcript annotations, even if those annotations do not cover all genes with CDS/exon annotations.

Would you be able to implement that support to help in cases like mine? thanks for considering it!

all the best,

Janet Young

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] genbankr_1.20.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7                  lattice_0.20-45             prettyunits_1.1.1          
 [4] png_0.1-7                   Rsamtools_2.8.0             Biostrings_2.60.2          
 [7] assertthat_0.2.1            digest_0.6.28               utf8_1.2.2                 
[10] BiocFileCache_2.0.0         R6_2.5.1                    GenomeInfoDb_1.28.4        
[13] stats4_4.1.1                RSQLite_2.2.8               httr_1.4.2                 
[16] pillar_1.6.3                zlibbioc_1.38.0             rlang_0.4.11               
[19] GenomicFeatures_1.44.2      progress_1.2.2              curl_4.3.2                 
[22] rstudioapi_0.13             rentrez_1.2.3               blob_1.2.2                 
[25] S4Vectors_0.30.2            Matrix_1.3-4                BiocParallel_1.26.2        
[28] stringr_1.4.0               RCurl_1.98-1.5              bit_4.0.4                  
[31] biomaRt_2.48.3              DelayedArray_0.18.0         compiler_4.1.1             
[34] rtracklayer_1.52.1          pkgconfig_2.0.3             BiocGenerics_0.38.0        
[37] SummarizedExperiment_1.22.0 tidyselect_1.1.1            KEGGREST_1.32.0            
[40] tibble_3.1.5                GenomeInfoDbData_1.2.6      matrixStats_0.61.0         
[43] IRanges_2.26.0              XML_3.99-0.8                fansi_0.5.0                
[46] crayon_1.4.1                dplyr_1.0.7                 dbplyr_2.1.1               
[49] GenomicAlignments_1.28.0    bitops_1.0-7                rappdirs_0.3.3             
[52] grid_4.1.1                  jsonlite_1.7.2              lifecycle_1.0.1            
[55] DBI_1.1.1                   magrittr_2.0.1              stringi_1.7.5              
[58] cachem_1.0.6                XVector_0.32.0              xml2_1.3.2                 
[61] ellipsis_0.3.2              filelock_1.0.2              generics_0.1.0             
[64] vctrs_0.3.8                 rjson_0.2.20                restfulr_0.0.13            
[67] tools_4.1.1                 bit64_4.0.5                 BSgenome_1.60.0            
[70] Biobase_2.52.0              glue_1.4.2                  purrr_0.3.4                
[73] MatrixGenerics_1.4.3        hms_1.1.1                   parallel_4.1.1             
[76] fastmap_1.1.0               yaml_2.2.1                  AnnotationDbi_1.54.1       
[79] GenomicRanges_1.44.0        memoise_2.0.0               VariantAnnotation_1.38.0   
[82] BiocIO_1.2.0