melisaolave / SNPs2CF

An R function to compute Concordance Factors from SNP datasets
GNU General Public License v3.0
13 stars 1 forks source link

vcf2phylip - segfault & output #6

Closed plhm closed 1 year ago

plhm commented 2 years ago

Hello there Dr. Olave,

Thanks much for SNPs2CF. I'm looking forward to being able to use SNP data from reduced representation sequencing in a network approach.

Dr. Olave, I'm having what appear to be issues in both vcf2phylip and potentially in SNPS2CF, therefore I'll open two issues.

First, with vcf2phylip. I'm running the function using 4.0.3 (although I've tried it in v4.1 and had the same problem) on a Linux system.

First, I'd like to point out what appears to be an issue with pegas' vcf reader function. I'd be interested to know if you've seen the same, and perhaps have conceived of other packages to import the vcf files. For me in my OS it very often throwed an error message arguing a segmentation issue. I have had to run the function multiple times to eventually got it to work. It was very much a trial and error approach. Here is one example of a failed run:

`workPath <- "~/work/Programs/" source(paste(workPath,"SNPs2CF/functions_v1.5.R", sep = "/")) phylipFile <- gsub(".vcf", ".phy",vcfFile) vcf2phylip(vcf.name=vcfFile, output.name= phylipFile, total.SNPs = nLoci) File apparently not yet accessed: Scanning file ddRAD_concat_distichus_minQ20minDP10maxDP284mac3geno95ind25_SNPs.ldpruned.vcf 1.304754 / 1.304754 Mb Done. Reading 500 / 2425 loci caught segfault address 0x4fbf000, cause 'memory not mapped'

Traceback: 1: read.vcf(vcf.name, to = total.SNPs) 2: vcf2phylip(vcf.name = vcfFile, output.name = phylipFile, total.SNPs = nLoci) `

The same commands in a separate R session worked just fine.

The second issue is related to the output for vcf2phylip. I'm using as an input vcf files produced by plink post linkage pruning. For half of my input vcf files (I'm using different filtering schemes), vcf2phylip appears to run just fine. However, for other filtering schemes, vcf2plink produce an odd string of characters which then understandably leads SNPs2CF to crash.

Here is a piece of that string: tail -n2 ddRAD_concat_distichus_minQ20minDP10maxDP284mac3geno95ind25_SNPs.ldpruned.phy | cut -c700-1000 ▒▒NA▒▒?▒w▒}Z6:▒NA▒▒M▒8NAd▒O▒▒NAZs▒▒▒!▒▒▒C▒▒{q▒y▒fu▒bi▒▒͓Ԇల▒^▒▒!▒▒o▒▒▒TTCAGTTTTTGAGTCCGCCAGTTCCAACCCCGGCGAGGCCAGAAATGCTCGTACAAGCCCTCTCGCGGCCGTTAAACCTCTGGCGATCTATGGTCCCGAGGGCACGCCCACCGCTCCTACGTCGCACCGCCAGAGAGGCGCACGGGATAACTATTGGTGTGCCACT?GTTAGCCGACCGTAGGGGCGATTGCCATGGAACCCCGCACCCGGACCAGACCACCT NANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANANATTCAGTTTTTGAGTCCGCCAGTTCCAACCCCGGCGAGGCCAGAAATGCTCGTACAAGCCCTCTCGCGGCCGTTAAACCTCTGGCGATCTATGGTCCCGAGGGC

Here is that same region of the line for the vcf column with the specimen:

awk '{print $NF}' ddRAD_concat_distichus_minQ20minDP10maxDP284mac3geno95ind25_SNPs.ldpruned.vcf | head -n 1000 | tail -n 300 | sed -z "s/\n/ /g" 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0

I'm happy to share the files if you'd want to try to replicate the problem.

Thanks much.

P

melisaolave commented 2 years ago

Hi Pietro,

1) it looks like you are running out of RAM memory. It is possible that some of your filters generates a very large dataset that needs more RAM than others. First thing, make sure of not stressing your computer too much (close all other apps and run a single R session with this program alone). If it still doesn't work, you might need a better computer with higher resources, or adding extra RAM if you are using a cluster.

2) I need to take a look to your plink file, otherwise I cannot help. Feel free of sending me the file that returns an error.

good luck!

Melisa

plhm commented 2 years ago

Hi Melissa!

  1. I did some Googl'in and I agree that this issue typically appears to be associated with RAM, but I've assigned up to 60Gb of RAM to a machine when running it (I'm running it in a cluster environment), and it was still crashing... I eventually got it to work, but as I said, there was no apparent pattern or reason. I just tried running the command multiple times under the same exact conditions until it worked.
  2. Awesome! To which e-mail I should send the file?

I got SNPs2CF running without much of a problem once I deleted the specimen that was giving me the weird characters. That's not idea, but it has allowed me to move forward for now.

Best,

P

melisaolave commented 2 years ago

1) 60gb is way be too much, a single core can't use that much. So, even when you are specifying 60gb, doesn't really mean that it is going to use all that. I cannot help here, but you might want to talk to an IT and see what settings are the best once in your specific cluster (e.g. selecting specific new generation cores, etc).

2) please send it to molave@mendoza-conicet.gob.ar

all the best, Melisa