marbl / CHM13

The complete sequence of a human genome
Other
920 stars 99 forks source link

CHM13 liftover chain is incompatible with the import.chain function in R package rtracklayer #80

Closed zhangjy859 closed 1 year ago

zhangjy859 commented 1 year ago

Hi, I am noted that CHM13 liftover chain is incompatible with the import.chain function in R package rtracklayer

# download https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/chm13v2-hg19.chain
library(rtracklayer) 
chain <- import.chain('chm13v2-hg19.chain')
Error in .local(con, format, text, ...) : 
  expected 11 elements in header, got 3, on line 4

I have checked import.chain is work well with UCSC liftover chain, eg hg38ToHg19.over.chain.gz (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz)

Regrads

Zhang

jamesdalg commented 1 year ago

I wish to report that this is an issue as well. If anyone is familiar with a workaround, that would be incredibly helpful. It happens with the latest version of rtracklayer and on both chains.

> hg38_to_chm13_chain<-rtracklayer::import.chain("grch38-chm13v2.chain")
Error in .local(con, format, text, ...) : 
  expected 11 elements in header, got 1, on line 1
> hg38_to_chm13_chain<-rtracklayer::import.chain("chm13v2-grch38.chain")
Error in .local(con, format, text, ...) : 
  expected 11 elements in header, got 3, on line 4
> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] magrittr_2.0.3                     StructuralVariantAnnotation_1.12.0 rtracklayer_1.56.1                 VariantAnnotation_1.42.1          
 [5] Rsamtools_2.12.0                   Biostrings_2.64.1                  XVector_0.36.0                     regioneR_1.28.0                   
 [9] GenomicInteractions_1.32.0         InteractionSet_1.26.1              SummarizedExperiment_1.26.1        Biobase_2.56.0                    
[13] MatrixGenerics_1.8.1               matrixStats_0.63.0                 GenomicRanges_1.48.0               GenomeInfoDb_1.32.4               
[17] IRanges_2.30.1                     S4Vectors_0.34.0                   BiocGenerics_0.42.0               

loaded via a namespace (and not attached):
  [1] colorspace_2.1-0         rjson_0.2.21             deldir_1.0-9             biovizBase_1.46.0        htmlTable_2.4.1          base64enc_0.1-3         
  [7] dichromat_2.0-0.1        rstudioapi_0.14          bit64_4.0.5              AnnotationDbi_1.58.0     fansi_1.0.4              xml2_1.3.4              
 [13] codetools_0.2-19         cachem_1.0.8             knitr_1.42               Formula_1.2-5            cluster_2.1.4            dbplyr_2.3.2            
 [19] png_0.1-8                compiler_4.2.3           httr_1.4.6               backports_1.4.1          Matrix_1.5-4.1           fastmap_1.1.1           
 [25] lazyeval_0.2.2           cli_3.6.1                htmltools_0.5.5          prettyunits_1.1.1        tools_4.2.3              igraph_1.4.2            
 [31] gtable_0.3.3             glue_1.6.2               GenomeInfoDbData_1.2.8   dplyr_1.1.2              rappdirs_0.3.3           Rcpp_1.0.10             
 [37] vctrs_0.6.2              xfun_0.39                stringr_1.5.0            lifecycle_1.0.3          restfulr_0.0.15          GS3DKViz_1.0.2          
 [43] ensembldb_2.22.0         XML_3.99-0.14            zlibbioc_1.42.0          scales_1.2.1             BSgenome_1.64.0          hms_1.1.3               
 [49] ProtGenerics_1.30.0      parallel_4.2.3           AnnotationFilter_1.22.0  RColorBrewer_1.1-3       yaml_2.3.7               curl_5.0.0              
 [55] memoise_2.0.1            gridExtra_2.3            ggplot2_3.4.2            biomaRt_2.52.0           rpart_4.1.19             latticeExtra_0.6-30     
 [61] stringi_1.7.12           RSQLite_2.3.1            BiocIO_1.6.0             checkmate_2.2.0          GenomicFeatures_1.48.4   filelock_1.0.2          
 [67] BiocParallel_1.30.4      rlang_1.1.1              pkgconfig_2.0.3          bitops_1.0-7             evaluate_0.21            lattice_0.21-8          
 [73] GenomicAlignments_1.32.1 htmlwidgets_1.6.2        bit_4.0.5                tidyselect_1.2.0         R6_2.5.1                 generics_0.1.3          
 [79] Hmisc_5.1-0              DelayedArray_0.22.0      DBI_1.1.3                pillar_1.9.0             foreign_0.8-84           KEGGREST_1.36.3         
 [85] RCurl_1.98-1.12          nnet_7.3-19              tibble_3.2.1             crayon_1.5.2             interp_1.1-4             utf8_1.2.3              
 [91] BiocFileCache_2.4.0      rmarkdown_2.21           jpeg_0.1-10              progress_1.2.2           grid_4.2.3               data.table_1.14.8       
 [97] blob_1.2.4               digest_0.6.31            munsell_0.5.0            Gviz_1.42.1     
jamesdalg commented 1 year ago

I realize I had an issue similar to this last year. One workaround that was suggested was to use crossmap. https://github.com/marbl/CHM13/issues/61 This will work for bed files, but not for SVs in VCF format. I can't think of an easy way to work around this other than to fix the chain file. In any case, hopefully this will help someone else until the chain file can be fixed.

jamesdalg commented 1 year ago

I fixed one of them. https://github.com/lawremi/rtracklayer/issues/23 Essentially, you can fix the chm13v2 to hg38 with sed. rtracklayer doesn't work with spaces instead of tabs.

However, the more important direction (migrating to CHM13) does not work.

sed -r 's/^([0-9]+) ([0-9]+) ([0-9]+)$/\1\t\2\t\3/' chm13v2-grch38.chain > chm13v2-grch38-tabs.chain
sed -r 's/^([0-9]+) ([0-9]+) ([0-9]+)$/\1\t\2\t\3/' grch38-chm13v2.chain > grch38-chm13v2-tabs.chain 
chm13_to_hg38_chain<-rtracklayer::import.chain("chm13v2-grch38-tabs.chain")
> hg38_to_chm13_chain<-rtracklayer::import.chain("grch38-chm13v2-tabs.chain")
Error in .local(con, format, text, ...) : 
  expected 11 elements in header, got 1, on line 1
diekhans commented 1 year ago

This is caused by the chains not having chainIds. This was fixed, but apparently never updated on S3. I put the fixed chains here:

https://hgwdev.gi.ucsc.edu/~markd/t2t/CHM13-fixed-chains/

jamesdalg commented 1 year ago

Seems to work!

> hg38_to_chm13_chain<-rtracklayer::import.chain("hg38-chm13v2.over.chain")
> chm13_to_hg38_chain<-rtracklayer::import.chain("chm13v2-hg38.over.chain")

no error messages.

zhangjy859 commented 1 year ago

Hi, @diekhans,

Thanks for updated liftover chain, but There are some issues here.

In fact, I failed to fix the problems in grch38-chm13v2.chain at that time, so as a bypass solution, I used grch38-chm13v2.chain and UCSC liftover to complete liftover and generate a large amount of data. It didn't report an error, but I was wondering if there was an underlying problem and if I needed to rerun my related process under fix's grch38-chm13v2.chain.

zhangjy859 commented 1 year ago

From my testing with limited data, there doesn't seem to be a significant difference in output, but I'm not sure if there is a potential issue, especially if I plan to use the modified file later in the process without updating the previous data

diekhans commented 1 year ago

The problem with the original chains was that the chain id column was set to zero on all the chains. This causes liftOver to generate an error. However, I don't believe that the chain ids are used by leftover. We just ran a program to add chain ids. I don't know what your bypass solution was, so I can't comment if it would make a difference.

zhangjy859 commented 1 year ago

Because the old chain file can be accpect by UCSC liftover, so I just write a ucsc liftover cmd warpper with R, its worked by convert my Granges to bed file, and use ucsc liftover to liftover the bed, and read the converted (liftover) bed as results, so if the old and new chain not effect ucsc liftover, its ok for me.

I have run a quick check:

# old
../liftOver hg38_bed ../grch38-chm13v2.chain test/hg38_high_to_chm13.old.bed test/hg38_to_chm13.old.unmapped -bedPlus=3
# new
../liftOver hg38_bed hg38-chm13v2.over.chain test/hg38_to_chm13.new.bed test/hg38_high_to_chm13.new.unmapped -bedPlus=3

md5sum hg38_to_chm13.old.bed
d500b2c86eac071a49edb865ae7c02e6  hg38_to_chm13.old.bed
md5sum hg38_to_chm13.new.bed 
d500b2c86eac071a49edb865ae7c02e6  hg38_to_chm13.new.bed
njspix commented 2 months ago

Just wanted to add, in case it is helpful for anyone, the following snippet that worked for me. Replaces tabs with spaces in the chain header and creates unique chain ids:

awk '/^chain/ {$1=$1; OFS=" "; $NF = "chain" sprintf("%03d", ++count)} 1' original.chain > fixed.chain