Closed lurens closed 1 year ago
Hi @lurens, That should be more than enough ram. Could you share the commands you used - it would be quite helpful. Also how many variants are reported at the first run (rough average per sample), and also the second re-genotype run?
This is the command : dysgu merge -v 2 *_re_geno.vcf.gz > merged_re_geno.vcf The total number of variants in the second run is about 44,000
Would you mind sharing the earlier commands in the pipeline? It would help me rule out some potential issues. Thanks
for i in $(cat $lista); \
do dysgu run -p 40 --mq 20 -v 2 --max-cov auto --exclude $CH0 $GENOME \
$WD/tmp_"$i" $BAM/"$i"_V4.1_dp.bam > $WD/"$i".vcf; done
Then merge
dysgu merge -v 2 $WD/*.vcf > $MERGED/merged.vcf
Then re-genotyped
for i in $(cat $lista);
do dysgu run -p 40 -v 2 --max-cov auto \
--sites $MERGED/merged.vcf \
--exclude $CH0 $GENOME $RE_GENO/tmp_"$i" $BAM/"$i"_V4.1_dp.bam \
> $RE_GENO/"$i"_re_geno.vcf; done
And finally we bgzipped the vcf
for i in $(cat $lista); \
do bgzip $RE_GENO/"$i"_re_geno.vcf
Maybe this is the issue, I mean working with bgzipped vcfs? Thanks
Thanks for the info. I think bgzipped data should be fine, its probably not that - although I guess you could test it? Let me have a look over the source code and see if I can spot any issues
Do you get any output in the log file for the final merge?
Not at all.
You dont have a line that says "Merge distance: " ... ?
Yes but then I got: Killed
I mean it spent 2 days and then it was killed
I think I will have to run some tests my end to see if I can reproduce the high memory usage on the second merge. Im not sure what is causing it - the first merge should be about the same amount of work as the second merge. Do you have an idea how much memory the first merge used? For example `sacct
can help if you ran using slurm.
Hi @lurens, Im still unsure whats causing the memory issue - I havent been able to repeat the very high memory usage yet, although I will keep testing. In the mean time, you can split the vcfs by chromosome and try merging individually, this should make memory usage lower, and might help identify any problem chromosomes that are causing the high mem usage. Here is a sample bash script to help (vcfs will need to be bgzipped and indexed):
mkdir -p split
for item in ./*.vcf.gz; do
echo $item
name=$(basename ${item} .vcf.gz)
bcftools index -s ${item} | \
cut -f 1 | \
while read C;
do bcftools view -O z -o split/${name}.split.${C}.vcf.gz $item "${C}" ;
done
done
mkdir -p merged
concat_str=""
for chrom in $(cat ~/Documents/Data/db/hg19/ucsc.hg19.fasta.fai | cut -f1); do
if ls split/*.split.${chrom}.*vcf.gz 1> /dev/null 2>&1; then
dysgu merge split/*.split.${chrom}.*vcf.gz > merged/${chrom}.vcf;
concat_str+="merged/${chrom}.vcf "
else
continue
fi
done;
bcftools concat ${concat_str} > all_merged.vcf
Hi Splitting by chrosome worked fine. Many thanks
Hello lurens, would you mind sharing your merging experience, I also used plant 30x data, but I did not get the results I wanted either by splitting chromosomes or by splitting different variants, and it did not take up a lot of memory
Hi I am working with plant WGS 20X data from more than 300 accessions. First, I call SV variants with dysgu run and then dysgu merge I re-genotyped at the sample level and when I have to merge the re-genotyped samples I got RAM issue (basically 500 gb RAM are not enough) What I found really weird is that the first merge was fine with no problem. Any idea on how I can manage this issue? Cheers