lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

Missing chromosomename prefix in DNAcopy output file after PSCBS #331

Closed ericblanc20 closed 11 months ago

ericblanc20 commented 11 months ago

Describe the issue We are using GRCh38.d1.vd1 (which has chr prefix in their chromosome names), and when we select the PSCBS segmentation, we notice that this prefix is missing in the <sample>_dnacopy.seg file, and in this file only.

The <sample>_dnacopy.seg file looks like:

$ head output/<sample>_dnacopy.seg 
ID  chrom   loc.start   loc.end num.mark    seg.mean    C
<sample>    1   925932  122026458   11983   -0.0481143805497686 2
<sample>    1   124932725   152060910   887 0.0376277667132479  2
<sample>    1   152060910   152929951   165 0.269045736247666   3
<sample>    1   152929951   248918032   8992    0.0376277667132479  2
<sample>    2   41547   27520613    1806    -0.0475065381774023 2
<sample>    2   27520613    27582676    47  0.226940040709251   3
<sample>    2   27582676    73448316    2993    -0.0475065381774023 2
<sample>    2   73448317    73454437    17  0.320080262369761   4
<sample>    2   73454437    73625221    22  0.0920044179441789  2

To Reproduce Command issued using the docker container:

Rscript /opt/PureCN/PureCN.R \
    --genome hg38 \
    --sampleid $sample \
    --tumor /input/coverage/$sample/bwa.${sample}_coverage_loess.txt.gz \
    --vcf /input/vcfs/bwa.mutect2.$sample.vcf.gz \
    --normaldb /input/panel_of_normals/normalDB_SureSelectv8_hg38.rds \
    --intervals /input/extra/intervals.list \
    --fun-segmentation PSCBS --model betabin --post-optimize --seed 1234567 \
    --out /output/$sample --out-vcf

The same behaviour is observed using the code from github.

Expected behavior The first lines of <sample>_dnacopy.seg should look like

ID  chrom   loc.start   loc.end num.mark    seg.mean    C
<sample>    chr1    925932  122026458   11983   -0.0481143805497686 2
<sample>    chr1    124932725   152060910   887 0.0376277667132479  2
<sample>    chr1    152060910   152929951   165 0.269045736247666   3
<sample>    chr1    152929951   248918032   8992    0.0376277667132479  2
<sample>    chr2    41547   27520613    1806    -0.0475065381774023 2
<sample>    chr2    27520613    27582676    47  0.226940040709251   3
<sample>    chr2    27582676    73448316    2993    -0.0475065381774023 2
<sample>    chr2    73448317    73454437    17  0.320080262369761   4
<sample>    chr2    73454437    73625221    22  0.0920044179441789  2

Session Info

Docker label: RELEASE_3_17

For the run based on the github code:

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] PureCN_2.7.12               VariantAnnotation_1.46.0   
 [3] Rsamtools_2.16.0            Biostrings_2.68.1          
 [5] XVector_0.40.0              SummarizedExperiment_1.30.2
 [7] Biobase_2.60.0              GenomicRanges_1.52.0       
 [9] GenomeInfoDb_1.36.3         IRanges_2.34.1             
[11] S4Vectors_0.38.2            MatrixGenerics_1.12.3      
[13] matrixStats_1.0.0           BiocGenerics_0.46.0        
[15] DNAcopy_1.74.1             

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0         dplyr_1.1.3              blob_1.2.4              
 [4] filelock_1.0.2           bitops_1.0-7             fastmap_1.1.1           
 [7] RCurl_1.98-1.12          BiocFileCache_2.8.0      GenomicAlignments_1.36.0
[10] XML_3.99-0.14            digest_0.6.33            lifecycle_1.0.3         
[13] KEGGREST_1.40.0          RSQLite_2.3.1            magrittr_2.0.3          
[16] compiler_4.3.1           rlang_1.1.1              progress_1.2.2          
[19] tools_4.3.1              utf8_1.2.3               yaml_2.3.7              
[22] data.table_1.14.8        rtracklayer_1.60.1       lambda.r_1.2.4          
[25] prettyunits_1.2.0        S4Arrays_1.0.6           mclust_6.0.0            
[28] bit_4.0.5                curl_5.0.2               DelayedArray_0.26.7     
[31] xml2_1.3.5               RColorBrewer_1.1-3       abind_1.4-5             
[34] BiocParallel_1.34.2      grid_4.3.1               fansi_1.0.4             
[37] colorspace_2.1-0         Rhdf5lib_1.22.1          ggplot2_3.4.3           
[40] scales_1.2.1             biomaRt_2.56.1           cli_3.6.1               
[43] crayon_1.5.2             generics_0.1.3           httr_1.4.7              
[46] rjson_0.2.21             rhdf5_2.44.0             DBI_1.1.3               
[49] cachem_1.0.8             stringr_1.5.0            zlibbioc_1.46.0         
[52] splines_4.3.1            parallel_4.3.1           AnnotationDbi_1.62.2    
[55] formatR_1.14             restfulr_0.0.15          vctrs_0.6.3             
[58] Matrix_1.6-0             VGAM_1.1-9               hms_1.1.3               
[61] bit64_4.0.5              GenomicFeatures_1.52.2   glue_1.6.2              
[64] codetools_0.2-19         gtable_0.3.4             stringi_1.7.12          
[67] futile.logger_1.4.3      BiocIO_1.10.0            munsell_0.5.0           
[70] tibble_3.2.1             pillar_1.9.0             rhdf5filters_1.12.1     
[73] rappdirs_0.3.3           GenomeInfoDbData_1.2.10  BSgenome_1.68.0         
[76] R6_2.5.1                 dbplyr_2.3.4             lattice_0.21-8          
[79] futile.options_1.0.1     png_0.1-8                memoise_2.0.1           
[82] gridExtra_2.3            pkgconfig_2.0.3         

Additional context

We have tried to add the chromosome names to .PSCBSoutput2DNAcopy, giving the function chr.hash as an extra argument. It appears to be working in our limited tests: the chromosome names are with prefix in the *_dnacopy.seg file, and there are no change to the 5 *.csv files, to the vcf file, and no apparent changes to the pdf files.

For completeness, the diff of the changes to file segmentationPSCBS are below:

$ diff PureCN_orig/R/segmentationPSCBS.R PureCN_modif/R/segmentationPSCBS.R 
136c136
<             segot <- .PSCBSoutput2DNAcopy(segPSCBSot, sampleid)
---
>             segot <- .PSCBSoutput2DNAcopy(segPSCBSot, sampleid, chr.hash)
150c150
<         seg <- .PSCBSoutput2DNAcopy(segPSCBS, sampleid)
---
>         seg <- .PSCBSoutput2DNAcopy(segPSCBS, sampleid, chr.hash)
156c156
<             segusd <- .PSCBSoutput2DNAcopy(segusd, sampleid)
---
>             segusd <- .PSCBSoutput2DNAcopy(segusd, sampleid, chr.hash)
243c243
< .PSCBSoutput2DNAcopy <- function(seg, sampleid) {
---
> .PSCBSoutput2DNAcopy <- function(seg, sampleid, chr.hash) {
249a250
>   sx$chrom <- .add.chr.name(sx$chrom, chr.hash)
lima1 commented 11 months ago

Hi @ericblanc20 , DNAcopy only supports numbers as chromosome names. Guess that's why I just kept the numbers for this output. The LOH file should provide all the information and more. Let me know any issues with the LOH file, I'd rather add more features there for now.

ericblanc20 commented 11 months ago

Hi, Sorry about that, I didn't notice chromosomes were numbered instead of named. The LOH file is fine. Thanks