hitaandrea / MGcount

MGcount github repository
GNU General Public License v3.0
14 stars 9 forks source link

“Error in IRanges:::find_overlaps: 'maxgap' must be a single integer” when running integrate_gtf_annotations.R #4

Closed samuelruizperez closed 1 year ago

samuelruizperez commented 1 year ago

Hello!

Thank you for developing this tool. 😀

I have been trying to use the integrate_gtf_annotations.R script to unify the most recent annotations for the mouse genome:

https://ftp.ensembl.org/pub/current_gtf/mus_musculus/Mus_musculus.GRCm39.108.gtf.gz https://mirbase.org/ftp/CURRENT/genomes/mmu.gff3 https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/genome_coordinates/gff3/mus_musculus.GRCm39.gff3.gz

But I got this error:

Error in IRanges:::find_overlaps_in_groups_NCList(ranges(query), q_space,  : 
  'maxgap' must be a single integer
Calls: <Anonymous> ... <Anonymous> -> .local -> findOverlaps_GNCList -> <Anonymous>
Execution halted

I believe the issue comes from the following lines of integrate_gtf_annotations.R, because transcripts@ranges@width is empty and round(median(transcripts@ranges@width)/2) is therefore not a valid integer:

## Find overlapping piRNA transcripts
transcripts <- gff.pi[gff.pi@elementMetadata$type == "transcript",]
ov <- IRanges::findOverlaps(transcripts, transcripts, type = "any", round(median(transcripts@ranges@width)/2))

The reason transcripts@ranges@width (which comes from gff.pi) ends up empty is because gff.rcent does not have a column named "type.1" when it is subset: gff.pi <- gff.rcent[gff.rcent@elementMetadata$type.1 == "piRNA",].

gff.rcent instead has two columns with the duplicated name "type".

print(gff.rcent) output:

GRanges object with 2252551 ranges and 12 metadata columns:
            seqnames            ranges strand |     source           type
               <Rle>         <IRanges>  <Rle> |   <factor>       <factor>
        [1]        1   3056358-3056384      - | RNAcentral transcript    
        [2]        1   3056358-3056384      - | RNAcentral noncoding_exon
        [3]        1   3056360-3056384      - | RNAcentral transcript    
        [4]        1   3056360-3056384      - | RNAcentral noncoding_exon
        [5]        1   3056361-3056384      - | RNAcentral transcript    
        ...      ...               ...    ... .        ...            ...
                score     phase                Name        type       databases
            <numeric> <integer>         <character> <character> <CharacterList>
        [1]        NA      <NA> URS0000253E70_10090       piRNA     ENA,PirBase
        [2]        NA      <NA> URS0000253E70_10090       piRNA     ENA,PirBase
        [3]        NA      <NA> URS000042B783_10090       piRNA     ENA,PirBase
        [4]        NA      <NA> URS000042B783_10090       piRNA     ENA,PirBase
        [5]        NA      <NA> URS00004DC026_10090       piRNA     ENA,PirBase
        ...       ...       ...                 ...         ...             ...
                                ID      source                 Parent
                       <character> <character>        <CharacterList>
        [1]  URS0000253E70_10090.0   alignment                       
        [2] URS0000253E70_10090...        <NA>  URS0000253E70_10090.0
        [3]  URS000042B783_10090.1   alignment                       
        [4] URS000042B783_10090...        <NA>  URS000042B783_10090.1
        [5]  URS00004DC026_10090.2   alignment                       
        ...                    ...         ...                    ...
                identity providing_databases
            <character>         <character>
        [1]        <NA>                <NA>
        [2]        <NA>                <NA>
        [3]        <NA>                <NA>
        [4]        <NA>                <NA>
        [5]        <NA>                <NA>
        ...         ...                 ...
  -------
  seqinfo: 59 sequences from an unspecified genome; no seqlengths

So I guess the root of the problem lies in the way the RNAcentral database is imported:

gff.rcent <- rtracklayer::import(paste0(db.dir,"rnacentral_mus_musculus.GRCm39.gff3.gz"))

It should be making the column names of the IRanges object unique by appending .1 to the second instance, but it is not doing so. I don't know much about how these functions work, so I fixed it for my purposes by adding this line after the import:

gff.rcent <- rtracklayer::import(paste0(db.dir,"rnacentral_mus_musculus.GRCm39.gff3.gz"))
colnames(gff.rcent@elementMetadata)[colnames(gff.rcent@elementMetadata)=="type"] <- c("type","type.1")

There is probably a better or more elegant solution, though. It could also be an issue with my R or R packages versions, but I am not sure.

Finally, after adding that line to the script, I encountered another error:

Error in loadNamespace(x) : there is no package called ‘igraph’
Calls: loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted

So, maybe igraph should be added to the requirements at the beginning of the script.

## Import libraries (these packages are required for the gtf integration)
library(Hmisc)
library(parallel)
library(plyr)
library(dplyr)
library(rtracklayer)
library(igraph)

Hope this helps.

Thank you, Samuel


Session details:

integrate_gtf_annotations.R from MGcount v1.0.2

sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-conda-linux-gnu (64-bit) - mamba 1.1.0 - conda 22.9.0
Running under: Ubuntu 18.04.6 LTS

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] igraph_1.3.1         rtracklayer_1.58.0   GenomicRanges_1.50.0
 [4] GenomeInfoDb_1.34.1  IRanges_2.32.0       S4Vectors_0.36.0    
 [7] BiocGenerics_0.44.0  dplyr_1.0.9          plyr_1.8.7          
[10] Hmisc_4.7-0          ggplot2_3.3.6        Formula_1.2-4       
[13] survival_3.3-1       lattice_0.20-45     

loaded via a namespace (and not attached):
 [1] Biobase_2.58.0              MatrixGenerics_1.10.0      
 [3] splines_4.2.0               latticeExtra_0.6-29        
 [5] GenomeInfoDbData_1.2.9      Rsamtools_2.14.0           
 [7] yaml_2.3.5                  pillar_1.7.0               
 [9] backports_1.4.1             glue_1.6.2                 
[11] digest_0.6.29               RColorBrewer_1.1-3         
[13] XVector_0.38.0              checkmate_2.1.0            
[15] colorspace_2.0-3            htmltools_0.5.2            
[17] Matrix_1.4-1                XML_3.99-0.9               
[19] pkgconfig_2.0.3             zlibbioc_1.44.0            
[21] purrr_0.3.4                 scales_1.2.0               
[23] jpeg_0.1-9                  BiocParallel_1.32.5        
[25] htmlTable_2.4.1             tibble_3.1.7               
[27] generics_0.1.2              ellipsis_0.3.2             
[29] withr_2.5.0                 SummarizedExperiment_1.28.0
[31] nnet_7.3-17                 cli_3.3.0                  
[33] magrittr_2.0.3              crayon_1.5.1               
[35] fansi_1.0.3                 foreign_0.8-82             
[37] tools_4.2.0                 data.table_1.14.2          
[39] matrixStats_0.62.0          BiocIO_1.8.0               
[41] lifecycle_1.0.1             stringr_1.4.0              
[43] munsell_0.5.0               DelayedArray_0.24.0        
[45] cluster_2.1.3               Biostrings_2.66.0          
[47] compiler_4.2.0              rlang_1.0.2                
[49] grid_4.2.0                  RCurl_1.98-1.6             
[51] rstudioapi_0.13             rjson_0.2.21               
[53] htmlwidgets_1.5.4           bitops_1.0-7               
[55] base64enc_0.1-3             restfulr_0.0.15            
[57] gtable_0.3.0                codetools_0.2-18           
[59] R6_2.5.1                    GenomicAlignments_1.34.0   
[61] gridExtra_2.3               knitr_1.39                 
[63] fastmap_1.1.0               utf8_1.2.2                 
[65] stringi_1.7.6               Rcpp_1.0.8.3               
[67] vctrs_0.4.1                 rpart_4.1.16               
[69] png_0.1-7                   tidyselect_1.1.2           
[71] xfun_0.31
hitaandrea commented 1 year ago

Hello Samuel,

Thanks for sharing your experience with this. I am not sure why the rtracklayer import function does not append a ".1" in your case. I just imported RNAcentral mus_musculus.GRCm39.gff3.gz annotations and I get the .1. In any case, I believe the way you solved the problem by renaming the columns is totally fine.

Indeed, the script for integrating annotations depends on "igraph". The library was not loaded at the beginning since the required function is called with the "::" namespace syntax on the script. However, igraph needs to be installed. I will add the library import in the next update as this will be more clear. Thanks for pointing it out.