google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.19k stars 721 forks source link

READ TAG: n_elements is zero #490

Closed GuillaumeHolley closed 2 years ago

GuillaumeHolley commented 2 years ago

Hi,

I am trying to run DeepVariant 1.2.0 on a few human samples PacBio HiFi data (about 30x coverage per sample). I first ran my samples through the PEPPER-Margin pipeline r0.4 to get a haplotagged BAM file. Then I ran DeepVariant as follows:

singularity exec -B ${SOME_PATHS} deepvariant_1.2.0.sif bash /opt/deepvariant/bin/run_deepvariant --model_type PACBIO --ref ${PATH_TO_REF} --reads MARGIN_PHASED.PEPPER_SNP_MARGIN.happlotagged.bam  --output_vcf sample.vcf.gz --output_gvcf sample.g.vcf.gz --num_shards 24 --make_examples_extra_args="realign_reads=false,min_mapping_quality=5" --sample_name MYSAMPLE --use-hp-information;

I have two problems:

  1. Right from the beginning (CALL VARIANT MODULE SELECTED), for each interval processed. I get thousands of READ TAG: n_elements is zero messages in the console. What does it mean and is it a problem or just a warning?
  2. I allocate 200GB of RAM for per job and they all seem to systematically fail on memory. I do not recall DeepVariant using that much memory in the past but I might be wrong. Is 200GB too light for a human genome PacBio Hifi 30x coverage dataset?

Thank you for your help, Guillaume

MariaNattestad commented 2 years ago
  1. I haven't seen the n_elements is zero message before, but googling it, it looks like a memory allocation problem.
  2. No, 200GB should absolutely be enough for DeepVariant with any setting, including PacBio HiFi.

I'm not sure what is going on here, but in case it helps unblock you for now, have you tried doing the haplotagging with whatshap? See the pacbio case study for exactly how we run this workflow: https://github.com/google/deepvariant/blob/r1.2/docs/deepvariant-pacbio-model-case-study.md. I would be curious if you are able to at least follow that case study on your compute setup and see whether that works. Our published PacBio model is based on that workflow with whatshap.

@kishwarshafin do you have any ideas about what is going on here?

GuillaumeHolley commented 2 years ago

Hi @MariaNattestad,

Thank you for your answer. So both issues are one single problem with memory allocation. I haven't tried the WhatsHap workflow for a while now (did it on some data maybe a year ago) but I will do the PacBio case study just to confirm that it works. I have been using the PEPPER-Margin-DeepVariant workflow on corrected ONT data with great success so far (in both Pacbio and ONT mode) without running into this problem. Ultimately, the phasing is just so much faster with PEPPER-Margin than WhatsHap.

pichuan commented 2 years ago

Hi @GuillaumeHolley , I'm not sure if this is related, but one separate issue we've noticed before with memory is: If your PacBio HiFi BAM has a lot of extra auxiliary tags, current DeepVariant code will try to parse all of them, and can sometimes run OOM as a result.

Can you check your input BAM and see if that could be the case? If it's not that, I wonder if you can reproduce this on a public BAM (process it through PEPPER-Margin ) and share that with us, so we can take a look?

Also adding @williamrowell in case you have seen this before.

GuillaumeHolley commented 2 years ago

Hi @pichuan,

You might actually be onto something. I have about 15 tags per read which is not that much but some of them seem to be very very long. In particular, tags fi, fp, ri and rp have really long lists of comma-separated numbers (those tags are described here. Maybe my workflow from the SMRT cells is incorrect. I use a tool named extracthifi to get the reads and pbmm2 to map those.

williamrowell commented 2 years ago

You are definitely following the recommended path for someone starting with reads.bam: reads.bam -> extracthifi -> hifi_reads.bam -> pbmm2 -> hifi_reads.aligned.bam -> DeepVariant.

The consensus kinetics tags are relatively recent additions, but we have noticed that these seem to cause OOM errors with make_examples, and if we strip these tags these errors seem to go away.

A short term fix would be to remove these tags to produce a BAM to be used as input for DeepVariant: samtools view -b -x fi -x fp -x ri -x rp hifi_reads.aligned.bam > hifi_reads.aligned.nokinetics.bam

kishwarshafin commented 2 years ago

@MariaNattestad thanks for the tag. Yes, we had seen this error before and @pichuan correctly identified it. If you have too many AUX tags then this error pops up. One way to test would be skip PEPPER-Margin entirely and run DeepVariant directly on the unphased bam and you'll see the same error. Unless WhatsHap is removing auxiliary tags, it should happen with that pipeline too.

pichuan commented 2 years ago

In the next release, we'll have a code update that only saves the tags we use in memory. Which will resolve this issue.

GuillaumeHolley commented 2 years ago

Wow, thank you @MariaNattestad @pichuan @williamrowell @kishwarshafin, this all went really quickly. I am currently in the process of generating {fi,fp,ri,rp}-tagless BAM files and will rerun DeepVariant on those. In the meantime, I will close this issue as I am fairly confident you found the solution to the problem. Thank you again!

GuillaumeHolley commented 2 years ago

Just to confirm: removing the tags worked. Thanks!