jianhong / ATACseqQC

ATAC-seq Quality Control
https://jianhong.github.io/ATACseqQC/articles/ATACseqQC.html
23 stars 12 forks source link

Fix wrong index assgined in gal1 #26

Closed ycl6 closed 3 years ago

ycl6 commented 4 years ago

Originally splitBam will take bamfile index and pass it to shiftGAlignmentsList and splitGAlignmentsByCut. Because meta$index is not empty, the wrong index (for the original bam) will get store into gal1 which has shifted.bam assgined to meta$file but not shifted.bam to its meta$index. Then splitGAlignmentsByCut halted with below error message:

Error in splitGAlignmentsByCut(gal1, breaks = breaks, labels = labels, : no reads in the obj.

By removing index param from splitBam, both shiftGAlignmentsList and splitGAlignmentsByCut will get the index from the ifelse from the respective meta$file

index <- ifelse(length(meta$index)>0, meta$index, meta$file)

Additionally, a line to include the bam index in gal1 was added in shiftGAlignmentsList.R to complete the metadata of gal1

jianhong commented 4 years ago

Thank you for reporting. I will fix that.

Best!

Your sincerely,

Jianhong Ou

On May 24, 2020, at 8:08 AM, I-Hsuan Lin notifications@github.com wrote:

 Originally splitBam will take bamfile index and pass it to shiftGAlignmentsList and splitGAlignmentsByCut. Because meta$index is not empty, the wrong index (for the original bam) will get store into gal1 which has shifted.bam assgined to meta$file but not shifted.bam to its meta$index. Then splitGAlignmentsByCut halted with below error message:

Error in splitGAlignmentsByCut(gal1, breaks = breaks, labels = labels, : no reads in the obj.

By removing index param from splitBam, both shiftGAlignmentsList and splitGAlignmentsByCut will get the index from the ifelse from the respective meta$file

index <- ifelse(length(meta$index)>0, meta$index, meta$file)

Additionally, a line to include the bam index in gal1 was added in shiftGAlignmentsList.R to complete the metadata of gal1

You can view, comment on, or merge this pull request online at:

https://github.com/jianhong/ATACseqQC/pull/26

Commit Summary

fix incorrect index assigned to gal Store outbam index in gal1 File Changes

M R/shiftGAlignmentsList.R (1) M R/splitBam.R (6) Patch Links:

https://github.com/jianhong/ATACseqQC/pull/26.patch https://github.com/jianhong/ATACseqQC/pull/26.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ycl6 commented 4 years ago

Thanks @jianhong

Another thing I found is that although gal1 is being added to objs before returning in splitBam,

  objs$all <- gal1
  return(invisible(objs))

objs$all does not contain all the information as that in gal1, hence although one can use splitBam to perform both shiftGAlignmentsList and splitGAlignmentsByCut as some kind of wrapper. but functions such as PTscore, NFRscore and TSSEscore cannot be performed because objs$all ≠ gal1. This then defeats the purpose of using splitBam

Another thing I am not quite sure of is why setting meta$asMates to FALSE for shifted.bam in gal1 inside shiftGAlignmentsList, where shifted.bam is still paired-end when the original bam is paired-end?

      if(!missing(outbam)){
        file.copy(from=mergedfile, to=outbam)
        file.copy(from=paste0(mergedfile, ".bai"), 
                  to=paste0(outbam, ".bai"))
        gal1 <- GAlignments()
        meta$file <- outbam
        meta$asMates <- FALSE
        meta$mpos <- mpos
        metadata(gal1) <- meta
jianhong commented 4 years ago

After carefully review the code, I decide not merge the pull. I still want to keep the index parameter. And I double checked the function by multiple files, I can not repeat your error.

About the PTscore, NFRscore and TSSEscore, all require GAlignments as input. If objs is the input, it will throw an error.

I set the asMate to FALSE, it does not mean it is not paired-end. It just mean it will not be read in as GAlignmentsList.

Hope this explanation will help.

Jianhong.

ycl6 commented 4 years ago

Hi @jianhong Thanks for getting back to me. I am not sure why I am getting the error with splitBam, so I include my script below and shared the bam and index file with you through dropbox (through your duke.edu email), maybe you can take a look at it. Thanks!

library(ATACseqQC)
library(Rsamtools)
library(GenomicScores)
library(BSgenome.Mmusculus.UCSC.mm10)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)

bamFiles = "ATACseq_IPSC_chr19.bam"
bamFiles.labels = "ATACseq_IPSC_chr19"

outPath = bamFiles.labels
dir.create(outPath)
shiftedBamfile = file.path(outPath, "shifted.bam")

chr = "chr19"
possibleTag = combn(LETTERS, 2)
possibleTag = c(paste0(possibleTag[1, ], possibleTag[2, ]), paste0(possibleTag[2, ], possibleTag[1, ]))

genome = Mmusculus
txs = transcripts(TxDb.Mmusculus.UCSC.mm10.knownGene)
sub.txs = txs[seqnames(txs) %in% chr,]

phast = getGScores("phastCons60way.UCSC.mm10")

print(bamFiles.labels)

bamTop100 = scanBam(BamFile(bamFiles, yieldSize = 100), param = ScanBamParam(tag = possibleTag))[[1]]$tag
tags = names(bamTop100)[lengths(bamTop100) == 100]

objs = splitBam(bamFiles, tags = tags, outPath = outPath, txs = sub.txs, genome = genome, conservation = phast, seqlev = chr)

Error in splitGAlignmentsByCut(gal1, breaks = breaks, labels = labels, : no reads in the obj.

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /home/ihsuan/miniconda3/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 GenomicFeatures_1.38.2                    AnnotationDbi_1.48.0                      Biobase_2.46.0                           
 [5] BSgenome.Mmusculus.UCSC.mm10_1.4.0        BSgenome_1.54.0                           rtracklayer_1.46.0                        GenomicScores_1.10.0                     
 [9] Rsamtools_2.2.3                           Biostrings_2.54.0                         XVector_0.26.0                            GenomicRanges_1.38.0                     
[13] GenomeInfoDb_1.22.1                       IRanges_2.20.2                            ATACseqQC_1.10.4                          S4Vectors_0.24.4                         
[17] BiocGenerics_0.32.0                      

loaded via a namespace (and not attached):
  [1] colorspace_1.4-1              seqinr_3.6-1                  grImport2_0.2-0               ellipsis_0.3.1                futile.logger_1.4.3           base64enc_0.1-3              
  [7] rGADEM_2.34.1                 ChIPpeakAnno_3.20.1           bit64_0.9-7                   interactiveDisplayBase_1.24.0 splines_3.6.3                 motifStack_1.30.0            
 [13] ade4_1.7-15                   polynom_1.4-0                 seqLogo_1.52.0                GO.db_3.10.0                  dbplyr_1.4.3                  png_0.1-7                    
 [19] graph_1.64.0                  shiny_1.4.0.2                 BiocManager_1.30.10           compiler_3.6.3                httr_1.4.1                    assertthat_0.2.1             
 [25] Matrix_1.2-18                 fastmap_1.0.1                 lazyeval_0.2.2                limma_3.42.2                  later_1.0.0                   formatR_1.7                  
 [31] htmltools_0.4.0               prettyunits_1.1.1             tools_3.6.3                   gtable_0.3.0                  glue_1.4.1                    GenomeInfoDbData_1.2.2       
 [37] dplyr_0.8.5                   rappdirs_0.3.1                Rcpp_1.0.4.6                  vctrs_0.3.0                   multtest_2.42.0               stringr_1.4.0                
 [43] mime_0.9                      lifecycle_0.2.0               ensembldb_2.10.2              XML_3.99-0.3                  idr_1.2                       AnnotationHub_2.18.0         
 [49] edgeR_3.28.1                  zlibbioc_1.32.0               MASS_7.3-51.6                 scales_1.1.1                  hms_0.5.3                     promises_1.1.0               
 [55] ProtGenerics_1.18.0           SummarizedExperiment_1.16.1   RBGL_1.62.1                   AnnotationFilter_1.10.0       lambda.r_1.2.4                yaml_2.2.1                   
 [61] curl_4.3                      memoise_1.1.0                 ggplot2_3.3.0                 MotIV_1.42.0                  biomaRt_2.42.1                stringi_1.4.6                
 [67] RSQLite_2.2.0                 BiocVersion_3.10.1            randomForest_4.6-14           BiocParallel_1.20.1           rlang_0.4.6                   pkgconfig_2.0.3              
 [73] matrixStats_0.56.0            bitops_1.0-6                  lattice_0.20-41               purrr_0.3.4                   htmlwidgets_1.5.1             GenomicAlignments_1.22.1     
 [79] bit_1.1-15.2                  tidyselect_1.1.0              magrittr_1.5                  R6_2.4.1                      DelayedArray_0.12.3           DBI_1.1.0                    
 [85] preseqR_4.0.0                 pillar_1.4.4                  survival_3.1-12               RCurl_1.98-1.2                tibble_3.0.1                  crayon_1.3.4                 
 [91] futile.options_1.0.1          KernSmooth_2.23-17            BiocFileCache_1.10.2          jpeg_0.1-8.1                  progress_1.2.2                locfit_1.5-9.4               
 [97] grid_3.6.3                    blob_1.2.1                    digest_0.6.25                 xtable_1.8-4                  VennDiagram_1.6.20            httpuv_1.5.2                 
[103] regioneR_1.18.1               openssl_1.4.1                 munsell_0.5.0                 askpass_1.1                  

Regarding objs, I was thinking how one can use splitBam to accomplish both shift and split with just one function but still able to calculate the three QC scores, because if it cannot, then it would seems splitBam is not that practical to use after all.

Given that gal1 is assigned to objs$all, I thought I can use objs$all to calculate the score, but found it doesn't contain the same information as gal1 that was assigned. I suppose the content that can be store by then is limited to the information that were already present in other elements in the GAlignmentsList.

I manage to print the structures of gal1 and objs$all when running splitBam:

This is str(gal1)

Formal class 'GAlignments' [package "GenomicAlignments"] with 9 slots
  ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. ..@ values         : Factor w/ 0 levels: 
  .. .. ..@ lengths        : int(0) 
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ start          : int(0) 
  ..@ cigar          : chr(0) 
  ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 
  .. .. ..@ lengths        : int(0) 
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ NAMES          : NULL
  ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : NULL
  .. .. ..@ nrows          : int 0
  .. .. ..@ listData       : Named list()
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  .. .. ..@ seqnames   : chr(0) 
  .. .. ..@ seqlengths : int(0) 
  .. .. ..@ is_circular: logi(0) 
  .. .. ..@ genome     : chr(0) 
  ..@ elementType    : chr "ANY"
  ..@ metadata       :List of 5
  .. ..$ file   : chr "ATACseq_IPSC_chr19/shifted.bam"
  .. ..$ param  :Formal class 'ScanBamParam' [package "Rsamtools"] with 8 slots
  .. .. .. ..@ flag             : Named int [1:2] 4095 1275
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "keep0" "keep1"
  .. .. .. ..@ simpleCigar      : logi FALSE
  .. .. .. ..@ reverseComplement: logi FALSE
  .. .. .. ..@ tag              : chr [1:10] "AS" "MD" "XG" "NM" ...
  .. .. .. ..@ tagFilter        : list()
  .. .. .. ..@ what             : chr [1:7] "qname" "flag" "mapq" "isize" ...
  .. .. .. ..@ which            :Formal class 'CompressedIRangesList' [package "IRanges"] with 5 slots
  .. .. .. .. .. ..@ unlistData     :Formal class 'IRanges' [package "IRanges"] with 6 slots
  .. .. .. .. .. .. .. ..@ start          : int 1
  .. .. .. .. .. .. .. ..@ width          : int 61431566
  .. .. .. .. .. .. .. ..@ NAMES          : chr "chr19"
  .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ elementType    : chr "IRanges"
  .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ partitioning   :Formal class 'PartitioningByEnd' [package "IRanges"] with 5 slots
  .. .. .. .. .. .. .. ..@ end            : int 1
  .. .. .. .. .. .. .. ..@ NAMES          : chr "chr19"
  .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. ..@ mapqFilter       : int NA
  .. ..$ asMates: logi FALSE
  .. ..$ which  :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  .. .. .. ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. ..@ values         : Factor w/ 1 level "chr19": 1
  .. .. .. .. .. ..@ lengths        : int 1
  .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. ..@ metadata       : list()
  .. .. .. ..@ ranges         :Formal class 'IRanges' [package "IRanges"] with 6 slots
  .. .. .. .. .. ..@ start          : int 1
  .. .. .. .. .. ..@ width          : int 61431566
  .. .. .. .. .. ..@ NAMES          : chr "chr19"
  .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. ..@ metadata       : list()
  .. .. .. ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. .. .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 3
  .. .. .. .. .. ..@ lengths        : int 1
  .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. ..@ metadata       : list()
  .. .. .. ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  .. .. .. .. .. ..@ seqnames   : chr "chr19"
  .. .. .. .. .. ..@ seqlengths : int 61431566
  .. .. .. .. .. ..@ is_circular: logi FALSE
  .. .. .. .. .. ..@ genome     : chr "mm10"
  .. .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. .. .. .. ..@ rownames       : NULL
  .. .. .. .. .. ..@ nrows          : int 1
  .. .. .. .. .. ..@ listData       : Named list()
  .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. ..@ metadata       : list()
  .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. ..@ metadata       : list()
  .. ..$ mpos   : Named int [1:961764] 41005997 41005994 36933961 36934385 7586559 7586542 5795907 5795937 57855236 57855400 ...
  .. .. ..- attr(*, "names")= chr [1:961764] "SRR5870471.10000055 41005994" "SRR5870471.10000055 41005997" "SRR5870471.10000068 36934385" "SRR5870471.10000068 36933961" ...

This is str(objs$all)

Formal class 'GAlignments' [package "GenomicAlignments"] with 9 slots
  ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. ..@ values         : Factor w/ 66 levels "chr1","chr2",..: 
  .. .. ..@ lengths        : int(0) 
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ start          : int(0) 
  ..@ cigar          : chr(0) 
  ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
  .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 
  .. .. ..@ lengths        : int(0) 
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ NAMES          : chr(0) 
  ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : NULL
  .. .. ..@ nrows          : int 0
  .. .. ..@ listData       :List of 17
  .. .. .. ..$ qname: chr(0) 
  .. .. .. ..$ flag : int(0) 
  .. .. .. ..$ mapq : int(0) 
  .. .. .. ..$ isize: int(0) 
  .. .. .. ..$ seq  :Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots
  .. .. .. .. .. ..@ pool           :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots
  .. .. .. .. .. .. .. ..@ xp_list                    : list()
  .. .. .. .. .. .. .. ..@ .link_to_cached_object_list: list()
  .. .. .. .. .. ..@ ranges         :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots
  .. .. .. .. .. .. .. ..@ group          : int(0) 
  .. .. .. .. .. .. .. ..@ start          : int(0) 
  .. .. .. .. .. .. .. ..@ width          : int(0) 
  .. .. .. .. .. .. .. ..@ NAMES          : NULL
  .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ elementType    : chr "DNAString"
  .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. ..@ metadata       : list()
  .. .. .. ..$ qual :Formal class 'PhredQuality' [package "Biostrings"] with 5 slots
  .. .. .. .. .. ..@ pool           :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots
  .. .. .. .. .. .. .. ..@ xp_list                    : list()
  .. .. .. .. .. .. .. ..@ .link_to_cached_object_list: list()
  .. .. .. .. .. ..@ ranges         :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots
  .. .. .. .. .. .. .. ..@ group          : int(0) 
  .. .. .. .. .. .. .. ..@ start          : int(0) 
  .. .. .. .. .. .. .. ..@ width          : int(0) 
  .. .. .. .. .. .. .. ..@ NAMES          : NULL
  .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. .. .. ..@ metadata       : list()
  .. .. .. .. .. ..@ elementType    : chr "BString"
  .. .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. .. ..@ metadata       : list()
  .. .. .. ..$ mrnm : Factor w/ 66 levels "chr1","chr2",..: 
  .. .. .. ..$ AS   : int(0) 
  .. .. .. ..$ XG   : int(0) 
  .. .. .. ..$ NM   : int(0) 
  .. .. .. ..$ XM   : int(0) 
  .. .. .. ..$ XN   : int(0) 
  .. .. .. ..$ XO   : int(0) 
  .. .. .. ..$ XS   : int(0) 
  .. .. .. ..$ YS   : int(0) 
  .. .. .. ..$ YT   : chr(0) 
  .. .. .. ..$ mpos : Named int(0) 
  .. .. .. .. ..- attr(*, "names")= chr(0) 
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  .. .. ..@ seqnames   : chr [1:66] "chr1" "chr2" "chr3" "chr4" ...
  .. .. ..@ seqlengths : int [1:66] 195471971 182113224 160039680 156508116 151834684 149736546 145441459 129401213 124595110 130694993 ...
  .. .. ..@ is_circular: logi [1:66] NA NA NA NA NA NA ...
  .. .. ..@ genome     : chr [1:66] NA NA NA NA ...
  ..@ elementType    : chr "ANY"
  ..@ metadata       : list()

objs is a GAlignmentsList, whereas objs$all is a GAlignments and potentially an acceptable input to calculate QC scores

> class(objs)
[1] "GAlignmentsList"
attr(,"package")
[1] "GenomicAlignments"

> class(objs$all)
[1] "GAlignments"
attr(,"package")
[1] "GenomicAlignments"