kishwarshafin / pepper

PEPPER-Margin-DeepVariant
MIT License
242 stars 42 forks source link

r0.6 seems stuck #112

Closed GuillaumeHolley closed 2 years ago

GuillaumeHolley commented 2 years ago

Hi there,

So I just tried the r0.6 on an 80x aligned ONT dataset (Guppy 5.x.x SUP model). It's been running for 2 days and 16 hours now with 24 threads and 128 GB of RAM. From the console output, I can see it is still at the beginning of the pipeline, command 2/8 , pepper_variant call_variant. The last input in the log file is INFO: [THREAD 00] 1290/1293 COMPLETE (99%) [ELAPSED TIME: 40 Min 15 Sec] for CALL VARIANT MODULE (step 1/3). It doesn't seem it can complete this step, otherwise it would show THREAD 0 FINISHED SUCCESSFULLY but it does not. I am running the Singularity version which I pulled exactly the same way I did for r0.4 and r0.5. To be sure it is an issue with r0.6, I also ran r0.5 on the same dataset and it already went much further than r0.6 in the pipeline in less than 24h. I don't know if this is related but I'm having a similar issue with hapdup 0.4 which also seems to get stuck and never complete.

I would be grateful if you could shed some light on this.

Guillaume

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

Would it be possible for you to run PEPPER on one chromosome -r chr1 for example? Want to see if it's a system issue or a data issue. Also, can you please post the parameter details of PEPPER that it outputs before the start of the program?

GuillaumeHolley commented 2 years ago

Hi @kishwarshafin,

Here are the PEPPER parameters:

[11-30-2021 18:56:09] INFO: ONT VARIANT CALLING MODE SELECTED.
[11-30-2021 18:56:09] INFO: THRESHOLDS ARE SET TO: 
[11-30-2021 18:56:09] INFO: MIN MAPQ:                           5
[11-30-2021 18:56:09] INFO: MIN SNP BASEQ:                      5
[11-30-2021 18:56:09] INFO: MIN INDEL BASEQ:                    5
[11-30-2021 18:56:09] INFO: MIN SNP FREQUENCY:                  0.1
[11-30-2021 18:56:09] INFO: MIN INSERT FREQUENCY:               0.15
[11-30-2021 18:56:09] INFO: MIN DELETE FREQUENCY:               0.15
[11-30-2021 18:56:09] INFO: MIN COVERAGE THRESHOLD:             5
[11-30-2021 18:56:09] INFO: MIN CANDIDATE SUPPORT:              2
[11-30-2021 18:56:09] INFO: MIN SNP CANDIDATE FREQUENCY:        0.1
[11-30-2021 18:56:09] INFO: MIN INDEL CANDIDATE FREQUENCY:      0.12
[11-30-2021 18:56:09] INFO: SKIP INDEL CANDIDATES:              False
[11-30-2021 18:56:09] INFO: MAX ALLOWED CANDIDATE IN ONE SITE:  4
[11-30-2021 18:56:09] INFO: MIN SNP PREDICTIVE VALUE:           0.3
[11-30-2021 18:56:09] INFO: MIN INSERT PREDICTIVE VALUE:        0.4
[11-30-2021 18:56:09] INFO: MIN DELETE PREDICTIVE VALUE:        0.4
[11-30-2021 18:56:09] INFO: SNP QV CUTOFF FOR RE-GENOTYPING:    20
[11-30-2021 18:56:09] INFO: INDEL QV CUTOFF FOR RE-GENOTYPING:  15

I will cancel now the current full genome run since it is stuck anyway and get started with the chr1-only instead. I'll get back to you on Monday. Thanks!

Guillaume

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

If the first step 1/8 takes more than 120mins, then there's something wrong. You can kill the process, if possible then can you please share the bam? Would be great if I could reproduce this locally.

GuillaumeHolley commented 2 years ago

Hi @kishwarshafin,

chr1 has completed without any issues, took 28 CPU hours and 217 GB of RAM.

kishwarshafin commented 2 years ago

217 GB of ram! Can you please provide the full log you got from PEPPER? I think it's picking up too many candidates.

GuillaumeHolley commented 2 years ago

Please find enclosed the log files for chr1. I also forgot to mention that this is a corrected ONT dataset. I forgot to mention it because corrected reads were working great with r0.4 and r0.5.

1_pepper.log 2_margin_haplotag.log 4_DeepVariant.log 5_merge_vcf.log

kishwarshafin commented 2 years ago

@GuillaumeHolley , is there anything available where I can read about the corrected reads? How much improvement in read quality should you expect?

GuillaumeHolley commented 2 years ago

The software used is Ratatosk and you can read about it here. The version I am using is not yet on GitHub but the results from the GitHub version should be fairly similar to what I have atm. The mean error rate of the corrected reads should be between 2 and 2.3%. The variant calling results are fairly similar to the uncorrected ones I believe, 99.8% F1 for SNPs and about 80% F1 for indels using Pepper-Margin 0.5.

kishwarshafin commented 2 years ago

Thank you, let me do a little bit of reading before I suggest parameter changes.

GuillaumeHolley commented 2 years ago

Hi @kishwarshafin,

Thanks for having a look into this. In the meantime, I tried another full genome run for the r0.6 with significantly more memory allocated for the job (requested 300GB) and I could proceed much further. The 4th step, DeepVariant, failed almost right at the beginning on the following error, ValueError: Not Found: Unknown reference name 'chrZ_001422_local'. I checked my reference file and it does contain this contig. Any ideas?

Thanks, Guillaume

GuillaumeHolley commented 2 years ago

Btw, I computed the error rate of the ONT dataset and it is around 1.4% after correction with an N50 of 33.5 kb.

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

I have read the manuscript and gone over a few of the things. Here's the difference between v0.5 and v0.6:

1) In version 0.5, we would only create examples of "one site" and classify between 15 classes (AA, AC etc), that had a lot of issues and we saw it as a severe limitation to the future of the pipeline. So, in v0.6 we changed it to go at a much more granular level. Now we create examples of "per candidate", if a site has multiple possible candidates, we create multiple images and independently classify them to "het/hom/hom-alt".

2) We also saw a major boost in read accuracy which meant we could be a little more flexible with the parameters to increase sensitivity. I'm currently benchmarking and the increase in sensitivity is tremendous.

3) In the newer version, we have exposed most of the parameters so you can tune them to your need. If you do a run_pepper_margin_deepvariant --help you'll see a lot of parameters exposed that you can use to customize things.

With all this said, I think what is happening is that the reads are being polished un-uniformly, meaning some reads have erroneous alleles that Ratatosk could not polish correctly and it's increasing the resolution. As PEPPER is trained with more granularity now, you are seeing these signals being picked up. The most ideal thing would be to train a model to fit your dataset, with 85x coverage, if it's a GIAB sample, you can do that very easily. In fact, I'm writing documentations now on how to train PEPPER. Would that be something you are interested in? Also, is there any public datasets of guppy 507_sup polished that I can locally assess?

As for the ValueError: Not Found: Unknown reference name 'chrZ_001422_local' error, can you see if PEPPER's VCF header has this contig annotated?

GuillaumeHolley commented 2 years ago

Hi @kishwarshafin,

Thank you for the detailed explanation, a lot of it makes sense. When you say some reads have erroneous alleles that Ratatosk could not polish correctly and it's increasing the resolution, do you mean that the correction causes the allele to be incorrect (because the algorithm is incorrect for this region or the read is just bad and makes the polishing incorrect) and this is why all these signals are being picked up? In general, it is true the polishing is not uniform because we try to be rather "conservative" and not correct subreads as soon as there is doubt. You can see that some regions are always problematic and when you look at the mapping, there are clusters of uncorrected subreads left.

I would be very much interested in training PEPPER on Ratatosk-corrected dataset indeed, if that would help improve the results with the newest versions of PEPPER. If you have an early version of your training, I can give you some feedback on it too as "early user" if that is also something you would be interested in.

The current dataset I am talking about is HG002 corrected with 60x Illumina. I would be happy to share it with you if it helps. I just need an email address and our IT service will make it available to you.

GuillaumeHolley commented 2 years ago

I forgot to write about the DeepVariant issue. chrZ_001422_local is not a contig used in the PEPPER_VARIANT_FULL.vcf.gz file. In this file, only chromosomes 1 to 22, X, Y and M are used.

kishwarshafin commented 2 years ago

Thank you for the detailed explanation, a lot of it makes sense. When you say some reads have erroneous alleles that Ratatosk could not polish correctly and it's increasing the resolution, do you mean that the correction causes the allele to be incorrect (because the algorithm is incorrect for this region or the read is just bad and makes the polishing incorrect) and this is why all these signals are being picked up?

I would say it's not polishing uniformly between all reads which is causing a consensus issue. So, when we go to independent candidates, we are picking up too many candidates.

In general, it is true the polishing is not uniform because we try to be rather "conservative" and not correct subreads as soon as there is doubt. You can see that some regions are always problematic and when you look at the mapping, there are clusters of uncorrected subreads left.

Yes, that's exactly what I'm saying, I think PEPPER is getting stuck in a repeat region where polishing isn't uniform.

I would be very much interested in training PEPPER on Ratatosk-corrected dataset indeed, if that would help improve the results with the newest versions of PEPPER. If you have an early version of your training, I can give you some feedback on it too as an "early user" if that is also something you would be interested in. The current dataset I am talking about is HG002 corrected with 60x Illumina. I would be happy to share it with you if it helps. I just need an email address and our IT service will make it available to you.

That would be great, however, at this point, a faster iteration would be to provide me a bam file so I can send you models and associated commands on how to run with those models. If you are at liberty to share data with our lab, please send an email to shafin@ucsc.edu, just want to make sure that we on-board the data properly.

I forgot to write about the DeepVariant issue. chrZ_001422_local is not a contig used in the PEPPER_VARIANT_FULL.vcf.gz file. In this file, only chromosomes 1 to 22, X, Y and M are used.

Ok, I believe that's the issue. If the contigs need to be consistent between the candidate VCF and the reference file. Would it be possible to subset the reference to Chr1-22, X, Y, M and run the DeepVariant step only?

GuillaumeHolley commented 2 years ago

I see, thank you for the explanation. To be honest, I was already surprised that r0.5 was already performing so well on the corrected reads given the lack of uniform correction. Your description of r0.5 vs r0.6 clears it all. I have arranged for my IT service to send you the BAM file (~200GB) through our temporary storage service. You should receive an email within the next few days.

I'll try the reference subset for DeepVariant.

Thanks!

GuillaumeHolley commented 2 years ago

Hi @kishwarshafin,

Just checking if you have received the link for the BAM file. Also, the whole dataset is 150x aligned, not 80x as I wrote in my first message.

Guillaume

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

So far I haven't received any email.

GuillaumeHolley commented 2 years ago

@kishwarshafin Sorry about that, it took a little bit of time to upload the BAM file. You should have received a link now. The file will stay online until December 28th.

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

The first thing that came to my mind after I got the file is the coverage. I do think 150x is a bit too much. Would be possible for you to downsample it to 60x and run PEPPER again? You can also run test-r0.7 and with 60x it should not have major issues.

I am also working on the ValueError that DeepVariant reports. A tiny bit back-logged but hopefully I'll have some answers for you next week.

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

I finally had some time to look into this issue in details. The first thing I wanted to do is to see why PEPPER is failing on your data. I did the following:

samtools view -b -@71 HG002_ONT.corrected.bam chr20 > HG002_ONT.corrected.chr20.bam
samtools index -@71 HG002_ONT.corrected.chr20.bam

And then I ran PMDV using:

time docker run -it -v /data:/data -v /data2:/data2 -v /scratch:/scratch \
-u `id -u`:`id -g` \
kishwars/pepper_deepvariant:test-r0.7 \
run_pepper_margin_deepvariant call_variant \
-b HG002_ONT.corrected.chr20.bam \
-f GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta \
-o HG002_chr20_ratatosk/ \
-t 71 \
-s HG002 \
-r chr20 \
--ont_r9_guppy5_sup

After running hap.py I'm getting 0.848400 for INDELs and 0.998423 for SNPs, can you confirm that this is also what you get? I'll do a whole genome run today.

One of the things I have changed this time is how we load data during inference. Previously we used to load the data from the entire genome during inference, but some recent reports showed it's not ideal. So I changed to load a chunk of the images to memory, maybe that's why you ran out of memory before. Hopefully, this whole-genome run with the issue fixed will succeed.

kishwarshafin commented 2 years ago

@GuillaumeHolley , I also ran test-r0.7 docker on the entire 150x HG002 data and it just finished. Took 333mins to finish on 72 CPUs + 377GB RAM. The peak memory usage I saw was 260GB used during DeepVariant make_examples. The final F1 score looks great (SNP: 0.9979, INDEL: 0.854).

Here's my lscpu output:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              72
On-line CPU(s) list: 0-71
Thread(s) per core:  2
Core(s) per socket:  18
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz

If you could please give this new version a try and if it fails again, I'll be happy to take another look.

GuillaumeHolley commented 2 years ago

Hi @kishwarshafin,

This is excellent news, thank you very much! Yes, about 99.8% and 85% for SNP and indels F1 is about right what I was expecting. Do you think we could get more out of the corrected data with a re-training? In any case, thank you for having a look into this. I am on holidays until the end of the week so I will make sure to fire up r0.7 when I am back on Friday.

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

Yes, I believe there's enough difference between Ratatosk and native guppy reads that re-training will have some effect. I ran some internal analysis and it looks like things should improve slightly with retraining.

For now, I'll prioritize releasing the next version and after the break and in the new year I'll prioritize re-training PEPPER on your data. By that time the training tutorial will also be up, so you can give it a try too if you want.

kishwarshafin commented 2 years ago

@GuillaumeHolley ,

Closing this issue as it's resolved for now.