gustaveroussy / EaCoN

Easy Copy Number !
MIT License
20 stars 14 forks source link

BED file problems #20

Closed l0ka closed 3 years ago

l0ka commented 4 years ago

Hi, thank you for this great (and very easy to use) package. I successfully run EaCoN on SNP array data and now I am trying to analyze WES data. I am currently experiencing some problems at the very first step of the analysis. The command I used is: BINpack.Maker(bed.file = "S06588914_Regions_coord_only.bed", bin.size = 50, genome.pkg = "BSgenome.Hsapiens.UCSC.hg19") I tried using both a bed file with and without the "chr" notation on chromosomes, however I always get the following:

Without "chr":

Loading BSgenome.Hsapiens.UCSC.hg19 ...
Checking BED ...
Loading BSgenome.Hsapiens.UCSC.hg19 ...
Error: BED file contains chromosome(s) not defined in the BSgenome.Hsapiens.UCSC.hg19 genome !

With "chr":

Loading BSgenome.Hsapiens.UCSC.hg19 ...
Checking BED ...
Loading BSgenome.Hsapiens.UCSC.hg19 ...
 Removing inproper coordinates (start >= end) ...
 Removing replicates ...
 Looking for overlaps ...
Error in validObject(object) : 
  invalid class "GRanges" object: 1: invalid object for slot "ranges" in class "GRanges": got class "CompressedIRangesList", should be or extend class "IRanges"
  invalid class "GRanges" object: 2: 'names(x)' must be NULL or have the length of 'x'

The BED file looks like this:

1       65509   65625
1       65831   65973
1       69481   69600
1       721381  721519
1       721530  721806
1       721851  721942
1       752916  753035
1       762095  762275
1       762280  762414
1       762420  762565

And the chromosomes are:

sort -u -k1,1 S06588914_Regions_coord_only.bed
1       65509   65625
10      95122   95214
11      193051  193278
12      176060  176177
13      19239785        19239905
14      19489123        19489302
15      20169986        20170101
16      79236   79596
17      6003    6197
18      82681   82802
19      107093  107209
2       41459   41665
20      68259   68454
21      9825763 9825854
22      16157525        16157634
3       239350  239636
4       53238   53359
5       140384  140841
6       203388  203765
7       193156  193342
8       182713  182858
9       154730  154822
X       200849  200991
Y       150849  150991

Session info:

R version 3.4.3 (2017-11-30)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /services/tools/anaconda2/4.0.0/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19.masked_1.3.99  
 [2] BSgenome.Hsapiens.UCSC.hg18_1.3.1000       
 [3] BiocInstaller_1.20.3                       
 [4] data.table_1.12.8                          
 [5] BSgenome.Hsapiens.UCSC.hg19_1.4.0          
 [6] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
 [7] BSgenome_1.38.0                            
 [8] rtracklayer_1.30.4                         
 [9] Biostrings_2.38.4                          
[10] XVector_0.10.0                             
[11] GenomicRanges_1.22.4                       
[12] GenomeInfoDb_1.6.3                         
[13] IRanges_2.4.8                              
[14] S4Vectors_0.8.11                           
[15] BiocGenerics_0.24.0                        
[16] EaCoN_0.3.4-1                              
[17] nvimcom_0.9-87                             

loaded via a namespace (and not attached):
 [1] mclust_5.4.5               Rcpp_1.0.4                
 [3] ASCAT_2.5.2                lattice_0.20-41           
 [5] Rsamtools_1.22.0           zoo_1.8-7                 
 [7] assertthat_0.2.1           digest_0.6.25             
 [9] foreach_1.5.0              R6_2.4.1                  
[11] futile.options_1.0.1       aroma.light_3.8.0         
[13] pillar_1.4.3               zlibbioc_1.24.0           
[15] rlang_0.4.5                R.utils_2.9.2             
[17] R.oo_1.23.0                DT_0.13                   
[19] BiocParallel_1.4.3         htmlwidgets_1.5.1         
[21] RCurl_1.98-1.1             compiler_3.4.3            
[23] pkgconfig_2.0.3            htmltools_0.4.0           
[25] tidyselect_1.0.0           SummarizedExperiment_1.0.2
[27] tibble_3.0.0               codetools_0.2-16          
[29] matrixStats_0.56.0         XML_3.98-1.3              
[31] changepoint_2.2.2          fansi_0.4.1               
[33] crayon_1.3.4               dplyr_0.8.5               
[35] MASS_7.3-51.5              GenomicAlignments_1.6.3   
[37] bitops_1.0-6               R.methodsS3_1.8.0         
[39] grid_3.4.3                 lifecycle_0.2.0           
[41] magrittr_1.5               formatR_1.7               
[43] cli_2.0.2                  doParallel_1.0.15         
[45] limma_3.34.9               seqinr_3.6-1              
[47] futile.logger_1.4.3        ellipsis_0.3.0            
[49] vctrs_0.2.4                lambda.r_1.2.4            
[51] RColorBrewer_1.1-2         iterators_1.0.12          
[53] tools_3.4.3                iotools_0.3-1             
[55] ade4_1.7-15                Biobase_2.30.0            
[57] glue_1.4.0                 purrr_0.3.3               
[59] rhdf5_2.14.0               BiocManager_1.30.10       
[61] affxparser_1.50.0

What should I do? Thank you in advance!

[UPDATE]: the chromosome error apparently is due to missing "chr" notation and "chrM" coordinates. The other error is still present.

aoumess commented 4 years ago

Hi,

If your bed natively misses "chr" but corresponds to the hg19 genome build, it's probably that it refers to the 1000genomes HS37D5 build (that is available as a BSgenome under bioconductor, package name 'BSgenome.Hsapiens.1000genomes.hs37d5'). That's the genome build used by the Legacy TCGA project, by example. You may have more luck using this genome build ? Could you please give me some feedback then ? I currently wonder what causes the second error :S

l0ka commented 4 years ago

Hi @aoumess, luckily it was just a problem of genome version. Using the original bed file (without "chr" notation and the mt chromosome coordinates) and the "BSgenome.Hsapiens.1000genomes.hs37d5" solved both issues. Thank you very much!

P.S.: for some samples, using facets I get this error: https://github.com/mskcc/facets/issues/46 Almost nobody seems to care about it but maybe it could be nice to adjust EaCoN code to prevent this annoying issue.