knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

Segfault - memory not mapped #147

Closed tirohia closed 4 years ago

tirohia commented 5 years ago

A different variation on the segfault errors suggested here and here maybe?

I have an annotated vcf file, that I am attempting to convert to a tidy format using vcfR2Tidy so that I can filter easily on a number of different fields, from that standard format vcf region and from that annotated version.

When I attempt to convert it, I get:

vcf = read.vcfR( "minimal.vcf" , verbose = FALSE )
vcfT = vcfR2tidy(vcf, format_fields = c("GT", "DP"))
*** caught segfault *** address 0x50, cause 'memory not mapped'

Which then sometimes (but not always) dumps me completely out of R. I've updated my installation of vcfR from the github master branch using devtools::install_github(repo="knausb/vcfR")

I'll paste my sessionInfo in at the bottom.

The vcf has been pre-filtered (using the VariantAnnotation package) on one of the annotation fields, to reduce it to ~71,000 variants. When I try to use it unfiltered (with or without annotations) it hangs, probably because at that stage of the proceedings there's ~400,000 variants.

In attempting to figure out if it was a size issue, I made a minimal example, with ~100 variants in it. It worked, so I'm guessing it's not a formatting thing, which made me suspect a memory problem. However, I'm doing this on the doing this on the landing node of a cluster so -

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           503G        122G        349G        4.2G         32G        374G
Swap:           15G          0B         15G

As I increase the size of the minimal example, it falls over on anything over ~2500 variants. On a different machine (my laptop) it falls over with anything over ~4500 variants though, and that has much less memory -

 $free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        8.5G        1.5G        1.5G        5.5G        5.2G
Swap:           15G        3.5G         12G

So I have no idea what it might be. Doesn't appear to be formatting (works with minimal example), doesn't appear to be memory (works better on a smaller machine) and doesn't appear to be specific to a particular location within the file (can do ~2500 or ~4500 on different machines). I even checked the page on [memory usage](https://knausb.github.io/vcfR_documentation and I should easily have sufficient for 10^7ish variants.

Any suggestions would be much appreciated - getting vcf's into a tidy format is a bit of a holy grail at the moment.

Ben.


> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /scale_wlg_persistent/filesets/opt_nesi/CS400_centos7_bdw/imkl/2018.4.274-gimpi-2018b/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_NZ.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_NZ.UTF-8        LC_COLLATE=en_NZ.UTF-8    
 [5] LC_MONETARY=en_NZ.UTF-8    LC_MESSAGES=en_NZ.UTF-8   
 [7] LC_PAPER=en_NZ.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] vcfR_1.8.0.9000             VariantAnnotation_1.28.13  
 [3] Rsamtools_1.34.1            Biostrings_2.50.2          
 [5] XVector_0.22.0              SummarizedExperiment_1.12.0
 [7] DelayedArray_0.8.0          BiocParallel_1.16.6        
 [9] matrixStats_0.54.0          Biobase_2.42.0             
[11] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[13] IRanges_2.16.0              S4Vectors_0.20.1           
[15] BiocGenerics_0.28.0         forcats_0.4.0              
[17] stringr_1.4.0               dplyr_0.8.3                
[19] purrr_0.3.3                 readr_1.3.1                
[21] tidyr_0.8.3                 tibble_2.1.3               
[23] ggplot2_3.1.1               tidyverse_1.2.1           
knausb commented 5 years ago

Hi @tirohia , thanks for the detailed report! Segfaults typically happen on the compiled code side of things. And as you've experienced they are usually pretty tragic. Could you share with me a (small as possible) file so that I can try to reproduce this on my machine? That's really the best option we have for me to try to address this. With the caveat that if you work with human you should not share data. Thanks! Brian

tirohia commented 5 years ago

Given the issue in #96 I initially thought it might have been formatting, but yeah.

Happy to send you a file, I'll flick a link it to the email address I've found via your website if that's alright, rather than posting it to git.

Ben.

knausb commented 4 years ago

In a private conversation it was determined that this was due to a mal-formed VCF file.