Closed tirohia closed 4 years ago
Hi @tirohia , thanks for the detailed report! Segfaults typically happen on the compiled code side of things. And as you've experienced they are usually pretty tragic. Could you share with me a (small as possible) file so that I can try to reproduce this on my machine? That's really the best option we have for me to try to address this. With the caveat that if you work with human you should not share data. Thanks! Brian
Given the issue in #96 I initially thought it might have been formatting, but yeah.
Happy to send you a file, I'll flick a link it to the email address I've found via your website if that's alright, rather than posting it to git.
Ben.
In a private conversation it was determined that this was due to a mal-formed VCF file.
A different variation on the segfault errors suggested here and here maybe?
I have an annotated vcf file, that I am attempting to convert to a tidy format using vcfR2Tidy so that I can filter easily on a number of different fields, from that standard format vcf region and from that annotated version.
When I attempt to convert it, I get:
Which then sometimes (but not always) dumps me completely out of R. I've updated my installation of vcfR from the github master branch using
devtools::install_github(repo="knausb/vcfR")
I'll paste my sessionInfo in at the bottom.
The vcf has been pre-filtered (using the VariantAnnotation package) on one of the annotation fields, to reduce it to ~71,000 variants. When I try to use it unfiltered (with or without annotations) it hangs, probably because at that stage of the proceedings there's ~400,000 variants.
In attempting to figure out if it was a size issue, I made a minimal example, with ~100 variants in it. It worked, so I'm guessing it's not a formatting thing, which made me suspect a memory problem. However, I'm doing this on the doing this on the landing node of a cluster so -
As I increase the size of the minimal example, it falls over on anything over ~2500 variants. On a different machine (my laptop) it falls over with anything over ~4500 variants though, and that has much less memory -
So I have no idea what it might be. Doesn't appear to be formatting (works with minimal example), doesn't appear to be memory (works better on a smaller machine) and doesn't appear to be specific to a particular location within the file (can do ~2500 or ~4500 on different machines). I even checked the page on [memory usage](https://knausb.github.io/vcfR_documentation and I should easily have sufficient for 10^7ish variants.
Any suggestions would be much appreciated - getting vcf's into a tidy format is a bit of a holy grail at the moment.
Ben.