kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
92 stars 11 forks source link

RAM Issue - merging large cohorts #50

Closed lurens closed 1 year ago

lurens commented 1 year ago

Hi I am working with plant WGS 20X data from more than 300 accessions. First, I call SV variants with dysgu run and then dysgu merge I re-genotyped at the sample level and when I have to merge the re-genotyped samples I got RAM issue (basically 500 gb RAM are not enough) What I found really weird is that the first merge was fine with no problem. Any idea on how I can manage this issue? Cheers

kcleal commented 1 year ago

Hi @lurens, That should be more than enough ram. Could you share the commands you used - it would be quite helpful. Also how many variants are reported at the first run (rough average per sample), and also the second re-genotype run?

lurens commented 1 year ago

This is the command : dysgu merge -v 2 *_re_geno.vcf.gz > merged_re_geno.vcf The total number of variants in the second run is about 44,000

kcleal commented 1 year ago

Would you mind sharing the earlier commands in the pipeline? It would help me rule out some potential issues. Thanks

lurens commented 1 year ago
for i in $(cat $lista); \  
    do   dysgu run -p 40 --mq 20 -v 2 --max-cov auto --exclude $CH0  $GENOME    \
                $WD/tmp_"$i" $BAM/"$i"_V4.1_dp.bam > $WD/"$i".vcf; done

Then merge

dysgu merge -v 2 $WD/*.vcf > $MERGED/merged.vcf

Then re-genotyped

    for i in $(cat $lista); 
       do   dysgu run -p 40 -v 2 --max-cov auto \
                --sites $MERGED/merged.vcf \
                --exclude $CH0 $GENOME $RE_GENO/tmp_"$i" $BAM/"$i"_V4.1_dp.bam \
               > $RE_GENO/"$i"_re_geno.vcf; done

And finally we bgzipped the vcf

    for i in $(cat $lista);  \
        do  bgzip $RE_GENO/"$i"_re_geno.vcf

Maybe this is the issue, I mean working with bgzipped vcfs? Thanks

kcleal commented 1 year ago

Thanks for the info. I think bgzipped data should be fine, its probably not that - although I guess you could test it? Let me have a look over the source code and see if I can spot any issues

kcleal commented 1 year ago

Do you get any output in the log file for the final merge?

lurens commented 1 year ago

Not at all.

kcleal commented 1 year ago

You dont have a line that says "Merge distance: " ... ?

lurens commented 1 year ago

Yes but then I got: Killed

lurens commented 1 year ago

I mean it spent 2 days and then it was killed

kcleal commented 1 year ago

I think I will have to run some tests my end to see if I can reproduce the high memory usage on the second merge. Im not sure what is causing it - the first merge should be about the same amount of work as the second merge. Do you have an idea how much memory the first merge used? For example `sacct can help if you ran using slurm.

kcleal commented 1 year ago

Hi @lurens, Im still unsure whats causing the memory issue - I havent been able to repeat the very high memory usage yet, although I will keep testing. In the mean time, you can split the vcfs by chromosome and try merging individually, this should make memory usage lower, and might help identify any problem chromosomes that are causing the high mem usage. Here is a sample bash script to help (vcfs will need to be bgzipped and indexed):

mkdir -p split
for item in ./*.vcf.gz; do
  echo $item
  name=$(basename ${item} .vcf.gz)
  bcftools index -s ${item} | \
  cut -f 1 | \
  while read C;
    do bcftools view -O z -o split/${name}.split.${C}.vcf.gz $item "${C}" ;
    done
  done

mkdir -p merged
concat_str=""
for chrom in $(cat ~/Documents/Data/db/hg19/ucsc.hg19.fasta.fai | cut -f1); do
  if ls split/*.split.${chrom}.*vcf.gz 1> /dev/null 2>&1; then
      dysgu merge split/*.split.${chrom}.*vcf.gz > merged/${chrom}.vcf;
      concat_str+="merged/${chrom}.vcf "
  else
      continue
  fi
  done;

bcftools concat ${concat_str} > all_merged.vcf 
lurens commented 1 year ago

Hi Splitting by chrosome worked fine. Many thanks

zhongleishi commented 7 months ago

Hello lurens, would you mind sharing your merging experience, I also used plant 30x data, but I did not get the results I wanted either by splitting chromosomes or by splitting different variants, and it did not take up a lot of memory