Error in rcpp_normalization

wong-ziyi commented 2 years ago

Hi,

When I run the srnadiffExp() I got warning message as below:

.. 1141080 elements found...
... Annotation step done.
Constructing object...
... done.
Warning message:
In rcpp_normalization(cvg, librarySize) :
  NAs introduced by coercion to integer range

And the srnaExp object looks like this: Since there are all NaN in normaFactors of srnaExp, srnadiff(srnaExp,nThreads=64) can not run appropriately:

Starting IR step...
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'args' in selecting a method for function 'do.call': subscript contains NAs

Below is the session info:

R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.4.0   dplyr_1.0.7     srnadiff_1.12.2

loaded via a namespace (and not attached):
  [1] backports_1.2.1             Hmisc_4.5-0                 BiocFileCache_2.0.0         lazyeval_0.2.2              splines_4.1.1               BiocParallel_1.26.2        
  [7] usethis_2.0.1               GenomeInfoDb_1.28.4         ggplot2_3.3.5               digest_0.6.27               ensembldb_2.16.4            htmltools_0.5.2            
 [13] fansi_0.5.0                 magrittr_2.0.1              checkmate_2.0.0             memoise_2.0.0               BSgenome_1.60.0             cluster_2.1.2              
 [19] limma_3.48.3                remotes_2.4.0               Biostrings_2.60.2           annotate_1.70.0             matrixStats_0.60.1          prettyunits_1.1.1          
 [25] jpeg_0.1-9                  colorspace_2.0-2            blob_1.2.2                  rappdirs_0.3.3              xfun_0.26                   callr_3.7.0                
 [31] crayon_1.4.1                RCurl_1.98-1.4              genefilter_1.74.0           survival_3.2-13             VariantAnnotation_1.38.0    glue_1.4.2                 
 [37] gtable_0.3.0                zlibbioc_1.38.0             XVector_0.32.0              DelayedArray_0.18.0         pkgbuild_1.2.0              BiocGenerics_0.38.0        
 [43] abind_1.4-5                 scales_1.1.1                DBI_1.1.1                   edgeR_3.34.1                Rcpp_1.0.7                  xtable_1.8-4               
 [49] progress_1.2.2              htmlTable_2.2.1             foreign_0.8-81              bit_4.0.4                   Formula_1.2-4               stats4_4.1.1               
 [55] baySeq_2.26.0               htmlwidgets_1.5.4           httr_1.4.2                  RColorBrewer_1.1-2          ellipsis_0.3.2              pkgconfig_2.0.3            
 [61] XML_3.99-0.7                Gviz_1.36.2                 nnet_7.3-16                 dbplyr_2.1.1                locfit_1.5-9.4              utf8_1.2.2                 
 [67] tidyselect_1.1.1            rlang_0.4.11                AnnotationDbi_1.54.1        munsell_0.5.0               tools_4.1.1                 cachem_1.0.6               
 [73] cli_3.0.1                   generics_0.1.0              RSQLite_2.2.8               devtools_2.4.2              evaluate_0.14               fastmap_1.1.0              
 [79] yaml_2.2.1                  processx_3.5.2              knitr_1.34                  bit64_4.0.5                 fs_1.5.0                    purrr_0.3.4                
 [85] KEGGREST_1.32.0             AnnotationFilter_1.16.0     xml2_1.3.2                  biomaRt_2.49.3              BiocStyle_2.20.2            compiler_4.1.1             
 [91] rstudioapi_0.13             filelock_1.0.2              curl_4.3.2                  png_0.1-7                   testthat_3.0.4              tibble_3.1.4               
 [97] geneplotter_1.70.0          stringi_1.7.4               ps_1.6.0                    GenomicFeatures_1.44.2      desc_1.3.0                  lattice_0.20-44            
[103] ProtGenerics_1.24.0         Matrix_1.3-4                vctrs_0.3.8                 pillar_1.6.2                lifecycle_1.0.0             BiocManager_1.30.16        
[109] data.table_1.14.0           bitops_1.0-7                rtracklayer_1.52.1          GenomicRanges_1.44.0        R6_2.5.1                    BiocIO_1.2.0               
[115] latticeExtra_0.6-29         gridExtra_2.3               IRanges_2.26.0              sessioninfo_1.1.1           dichromat_2.0-0             assertthat_0.2.1           
[121] pkgload_1.2.2               SummarizedExperiment_1.22.0 DESeq2_1.32.0               rprojroot_2.0.2             rjson_0.2.20                withr_2.4.2                
[127] GenomicAlignments_1.28.0    Rsamtools_2.8.0             S4Vectors_0.30.0            GenomeInfoDbData_1.2.6      parallel_4.1.1              hms_1.1.0                  
[133] grid_4.1.1                  rpart_4.1-15                rmarkdown_2.11              MatrixGenerics_1.4.3        biovizBase_1.40.0           Biobase_2.52.0             
[139] base64enc_0.1-3             restfulr_0.0.13

The .bam files were generated using STAR software with the parameters below:

        STAR \
        --runMode alignReads \
        --genomeDir ./GenomeIndex_ENSEMBL_STAR_ReadLength$((${READ_LENGTH}-1)) \
        --readFilesIn ${name1} ${name2} \
        --outFileNamePrefix ${parentdir}.STAR. \
        --alignEndsType Local \
        --outSAMtype BAM SortedByCoordinate \
        --outFilterMultimapNmax 20 \
        --outFilterMismatchNmax 999 \
        --outFilterMismatchNoverLmax 0.04 \
        --alignIntronMin 70 \
        --alignIntronMax 1000000 \
        --alignMatesGapMax 1000000 \
        --alignSJoverhangMin 8 \
        --alignSJDBoverhangMin 1 \
        --outSAMstrandField intronMotif \
        --outFilterType BySJout \
        --twopassMode Basic \
        --outSAMmapqUnique 60 \
        --limitGenomeGenerateRAM ${Mem} \
        --runThreadN ${core} \
        --readFilesCommand zcat

sample information: paths to the BAM files: annotation:

GRanges object with 1141080 ranges and 20 metadata columns:
                    seqnames              ranges strand |   source            type     phase            gene_id gene_version gene_source   gene_biotype      transcript_id
                       <Rle>           <IRanges>  <Rle> | <factor>        <factor> <integer>        <character>  <character> <character>    <character>        <character>
        annot.1            1   67821954-67828966      - |  ensembl      gene            <NA> ENSOCUG00000038447            1     ensembl         lncRNA               <NA>
        annot.2            1   67821954-67828966      - |  ensembl      transcript      <NA> ENSOCUG00000038447            1     ensembl         lncRNA ENSOCUT00000060984
        annot.3            1   67828582-67828966      - |  ensembl      exon            <NA> ENSOCUG00000038447            1     ensembl         lncRNA ENSOCUT00000060984
        annot.4            1   67821954-67822812      - |  ensembl      exon            <NA> ENSOCUG00000038447            1     ensembl         lncRNA ENSOCUT00000060984
        annot.5            1 146132020-146132970      - |  ensembl      gene            <NA> ENSOCUG00000026335            1     ensembl protein_coding               <NA>
            ...          ...                 ...    ... .      ...             ...       ...                ...          ...         ...            ...                ...
  annot.1141076 AAGW02084006           2549-3515      + |  ensembl exon                 <NA> ENSOCUG00000022430            2     ensembl protein_coding ENSOCUT00000008112
  annot.1141077 AAGW02084006           2549-2745      + |  ensembl CDS                     2 ENSOCUG00000022430            2     ensembl protein_coding ENSOCUT00000008112
  annot.1141078 AAGW02084006           2746-2748      + |  ensembl stop_codon              0 ENSOCUG00000022430            2     ensembl protein_coding ENSOCUT00000008112
  annot.1141079 AAGW02084006           2339-2399      + |  ensembl five_prime_utr       <NA> ENSOCUG00000022430            2     ensembl protein_coding ENSOCUT00000008112
  annot.1141080 AAGW02084006           2749-3515      + |  ensembl three_prime_utr      <NA> ENSOCUG00000022430            2     ensembl protein_coding ENSOCUT00000008112
                transcript_version transcript_source transcript_biotype exon_number             exon_id exon_version         protein_id protein_version   gene_name transcript_name
                       <character>       <character>        <character> <character>         <character>  <character>        <character>     <character> <character>     <character>
        annot.1               <NA>              <NA>               <NA>        <NA>                <NA>         <NA>               <NA>            <NA>        <NA>            <NA>
        annot.2                  1           ensembl             lncRNA        <NA>                <NA>         <NA>               <NA>            <NA>        <NA>            <NA>
        annot.3                  1           ensembl             lncRNA           1 ENSOCUEE00000387222            1               <NA>            <NA>        <NA>            <NA>
        annot.4                  1           ensembl             lncRNA           2 ENSOCUEE00000365548            1               <NA>            <NA>        <NA>            <NA>
        annot.5               <NA>              <NA>               <NA>        <NA>                <NA>         <NA>               <NA>            <NA>        <NA>            <NA>
            ...                ...               ...                ...         ...                 ...          ...                ...             ...         ...             ...
  annot.1141076                  3           ensembl     protein_coding           2  ENSOCUE00000259531            2               <NA>            <NA>        <NA>            <NA>
  annot.1141077                  3           ensembl     protein_coding           2                <NA>         <NA> ENSOCUP00000035303               1        <NA>            <NA>
  annot.1141078                  3           ensembl     protein_coding           2                <NA>         <NA>               <NA>            <NA>        <NA>            <NA>
  annot.1141079                  3           ensembl     protein_coding        <NA>                <NA>         <NA>               <NA>            <NA>        <NA>            <NA>
  annot.1141080                  3           ensembl     protein_coding        <NA>                <NA>         <NA>               <NA>            <NA>        <NA>            <NA>
                projection_parent_transcript         tag
                                 <character> <character>
        annot.1                         <NA>        <NA>
        annot.2                         <NA>        <NA>
        annot.3                         <NA>        <NA>
        annot.4                         <NA>        <NA>
        annot.5                         <NA>        <NA>
            ...                          ...         ...
  annot.1141076                         <NA>        <NA>
  annot.1141077                         <NA>        <NA>
  annot.1141078                         <NA>        <NA>
  annot.1141079                         <NA>        <NA>
  annot.1141080                         <NA>        <NA>
  -------
  seqinfo: 1402 sequences from an unspecified genome; no seqlengths

Should I filter out some reads in my bam files to avoid such errors? If so, what reads should I remove? Or, should I change something when running the STAR software?

mzytnicki commented 2 years ago

@wong-ziyi Thanks for the issue!

I think I might understand the error.

The warning NAs introduced by coercion to integer range sometimes means that some number reached the maximum number size (2,147,483,648).

I will try to debug the code, but other issues may also be triggered later on. The easiest for me would be for you to share the bam files. Would that be possible?

wong-ziyi commented 2 years ago

@mzytnicki Thanks so much for your quick reply.

I attached two bam files below.

https://www.dropbox.com/sh/pc9ceapgg0kr9gc/AADgsHYnUZciiiNkqbTe55rKa?dl=0

Hope two bam files are enough for you because my dropbox space was already full.

P.S. the index file was generated using samtools index -b.

mzytnicki commented 2 years ago

@wong-ziyi Thanks! I will to debug it ASAP. I will keep you updated.

wong-ziyi commented 2 years ago

@mzytnicki I now just finished reading the article related to this software. I am sorry that I misunderstood the purpose of srnadiff. I was planned to use srnadiff for alternative splicing analysis. In my original thinking, srnadiff can output the differential expressed region, and then I can classify those regions into different types of alternative splicing according to their genome coordinate and the genome annotation. I think this error is due to my larger mRNA sequencing files than normal sRNA sequencing results, maybe?

Do you think the above workflow is suitable? If not, I am so sorry to bother you.

mzytnicki commented 2 years ago

@wong-ziyi Well, I did not try. However, I do not expect it to work, since the package was created with small RNA analysis in mind. I have never tried, but I would be glad if that worked!

That said, another package derfinder, was made for messenger RNA analysis. I do not know, however, if it works for alternative splicing...

mzytnicki commented 2 years ago

@wong-ziyi Anyway, if you want to give srnadiff a try, you can.

library(devtools)
install_github("mzytnicki/srnadiff")

Does that work?

mzytnicki / srnadiff

Error in rcpp_normalization #1