google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.17k stars 713 forks source link

Compatibility of GATK4 gVCF files with DeepVariant for joint calling #778

Closed WeiYang-BAI closed 6 months ago

WeiYang-BAI commented 6 months ago

Dear developer,

Thank you for providing this wonderful tool! I am wondering about the compatibility of individual-level gVCF files obtained through GATK4 with DeepVariant for joint calling of multiple samples. Specifically, I have numerous individual gVCF files and would like to know if DeepVariant can effectively handle the joint calling process using these gVCFs.

Best regards,

AndrewCarroll commented 6 months ago

Hi @WeiYang-BAI

I would recommend using GLnexus to merge gVCFs of GATK and DeepVariant. GLnexus has been optimized for both GATK and DeepVariant outputs. There are different presets for GLnexus, to combine multiple methods I would recommeng using unfiltered settings.

We observe that the GATK joint genotyper doesn't seem to handle DeepVariant gVCFs well, and the accuracy is much lower after using GATK on those.

WeiYang-BAI commented 6 months ago

Got it, thanks @AndrewCarroll !

Modernism-01 commented 5 months ago

Hi @AndrewCarroll ,

I followed the instructions to merge gvcf file into a final vcf via GLnexus with the default parameters like this:

singularity exec glnexus.sif glnexus_cli --config DeepVariantWGS $gvcf_path/*.gvcf.gz > ${output_bcf}

But it only output 62409 SNPs in the final vcf file (pepper.merged.glnexus.vcf.gz 6.8M), there are 5 input gvcf files (each of one is about 11GB, the sample is from the whole genome of pig).

the below is the log from GLnexus.

INFO: Convert SIF file to sandbox... WARNING: underlay of /etc/localtime required more than 50 (77) bind mounts [71420] [2024-04-03 09:10:42.182] [GLnexus] [info] glnexus_cli release v1.4.1-0-g68e25e5 Aug 13 2021 [71420] [2024-04-03 09:10:42.182] [GLnexus] [info] detected jemalloc 5.2.1-0-gea6b3e973b477b8061e0076bb257dbd7f3faa756 [71420] [2024-04-03 09:10:42.183] [GLnexus] [info] Loading config preset DeepVariantWGS [71420] [2024-04-03 09:10:42.190] [GLnexus] [info] config: unifier_config: drop_filtered: false min_allele_copy_number: 1 min_AQ1: 10 min_AQ2: 10 min_GQ: 0 max_alleles_per_site: 32 monoallelic_sites_for_lost_alleles: true preference: common genotyper_config: revise_genotypes: true min_assumed_allele_frequency: 9.99999975e-05 snv_prior_calibration: 0.600000024 indel_prior_calibration: 0.449999988 required_dp: 0 allow_partial_data: true allele_dp_format: AD ref_dp_format: MIN_DP output_residuals: false more_PL: true squeeze: false trim_uncalled_alleles: true top_two_half_calls: false output_format: BCF liftover_fields:

  • {orig_names: [MIN_DP, DP], name: DP, description: "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read depth (reads with MQ=255 or with bad mates are filtered)\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}
  • {orig_names: [AD], name: AD, description: "##FORMAT=<ID=AD,Number=R,Type=Integer,Description=\"Allelic depths for the ref and alt alleles in the order listed\">", type: int, number: alleles, default_type: zero, count: 0, combi_method: min, ignore_non_variants: false}
  • {orig_names: [GQ], name: GQ, description: "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}
  • {orig_names: [PL], name: PL, description: "##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"Phred-scaled genotype Likelihoods\">", type: int, number: genotype, default_type: missing, count: 0, combi_method: missing, ignore_non_variants: true} [71420] [2024-04-03 09:10:42.191] [GLnexus] [info] config CRC32C = 2932316105 [71420] [2024-04-03 09:10:42.191] [GLnexus] [info] init database, exemplar_vcf=/public/home/zenglingsen/01.data/01.ONT_data/01.ONT_20X_fastq_SNP_calling/03.pepper/01.gvcf/gvcf_file/AW.new.excluded.mnps.gvcf.gz [71420] [2024-04-03 09:10:42.611] [GLnexus] [info] Initialized GLnexus database in GLnexus.DB [71420] [2024-04-03 09:10:42.611] [GLnexus] [info] bucket size: 30000 [71420] [2024-04-03 09:10:42.612] [GLnexus] [info] contigs: NC_010443.5 NC_010444.4 NC_010445.4 NC_010446.5 NC_010447.5 NC_010448.4 NC_010449.5 NC_010450.4 NC_010451.4 NC_010452.4 NC_010453.5 NC_010454.4 NC_010455.5 NC_010456.5 NC_010457.5 NC_010458.4 NC_010459.5 NC_010460.4 NC_010461.5 NC_010462.3 NW_018084777.1 NW_018084778.1 NW_018084779.1 NW_018084780.1 NW_018084781.1 NW_018084782.1 NW_018084783.1 NW_018084784.1 NW_018084785.1 NW_018084786.1 NW_018084787.1 NW_018084788.1 NW_018084789.1 NW_018084790.1 NW_018084791.1 NW_018084792.1 NW_018084793.1 NW_018084794.1 NW_018084795.1 NW_018084796.1 NW_018084797.1 NW_018084798.1 NW_018084799.1 NW_018084800.1 NW_018084801.1 NW_018084802.1 NW_018084803.1 NW_018084804.1 NW_018084805.1 NW_018084806.1 NW_018084807.1 NW_018084808.1 NW_018084809.1 NW_018084810.1 NW_018084811.1 NW_018084812.1 NW_018084813.1 NW_018084814.1 NW_018084815.1 NW_018084816.1 NW_018084817.1 NW_018084818.1 NW_018084819.1 NW_018084820.1 NW_018084821.1 NW_018084822.1 NW_018084823.1 NW_018084824.1 NW_018084825.1 NW_018084826.1 NW_018084827.1 NW_018084828.1 NW_018084829.1 NW_018084830.1 NW_018084831.1 NW_018084832.1 NW_018084833.1 NW_018084834.1 NW_018084835.1 NW_018084836.1 NW_018084837.1 NW_018084838.1 NW_018084839.1 NW_018084840.1 NW_018084841.1 NW_018084842.1 NW_018084843.1 NW_018084844.1 NW_018084845.1 NW_018084846.1 NW_018084847.1 NW_018084848.1 NW_018084849.1 NW_018084850.1 NW_018084851.1 NW_018084852.1 NW_018084853.1 NW_018084854.1 NW_018084855.1 NW_018084856.1 NW_018084857.1 NW_018084858.1 NW_018084859.1 NW_018084860.1 NW_018084861.1 NW_018084862.1 NW_018084863.1 NW_018084864.1 NW_018084865.1 NW_018084866.1 NW_018084867.1 NW_018084868.1 NW_018084869.1 NW_018084870.1 NW_018084871.1 NW_018084872.1 NW_018084873.1 NW_018084874.1 NW_018084875.1 NW_018084876.1 NW_018084877.1 NW_018084878.1 NW_018084879.1 NW_018084880.1 NW_018084881.1 NW_018084882.1 NW_018084883.1 NW_018084884.1 NW_018084885.1 NW_018084886.1 NW_018084887.1 NW_018084888.1 NW_018084889.1 NW_018084890.1 NW_018084891.1 NW_018084892.1 NW_018084893.1 NW_018084894.1 NW_018084895.1 NW_018084896.1 NW_018084897.1 NW_018084898.1 NW_018084899.1 NW_018084900.1 NW_018084901.1 NW_018084902.1 NW_018084903.1 NW_018084904.1 NW_018084905.1 NW_018084906.1 NW_018084907.1 NW_018084908.1 NW_018084909.1 NW_018084910.1 NW_018084911.1 NW_018084912.1 NW_018084913.1 NW_018084914.1 NW_018084915.1 NW_018084916.1 NW_018084917.1 NW_018084918.1 NW_018084919.1 NW_018084920.1 NW_018084921.1 NW_018084922.1 NW_018084923.1 NW_018084924.1 NW_018084925.1 NW_018084926.1 NW_018084927.1 NW_018084928.1 NW_018084929.1 NW_018084930.1 NW_018084931.1 NW_018084932.1 NW_018084933.1 NW_018084934.1 NW_018084935.1 NW_018084936.1 NW_018084937.1 NW_018084938.1 NW_018084939.1 NW_018084940.1 NW_018084941.1 NW_018084942.1 NW_018084943.1 NW_018084944.1 NW_018084945.1 NW_018084946.1 NW_018084947.1 NW_018084948.1 NW_018084949.1 NW_018084950.1 NW_018084951.1 NW_018084952.1 NW_018084953.1 NW_018084954.1 NW_018084955.1 NW_018084956.1 NW_018084957.1 NW_018084958.1 NW_018084959.1 NW_018084960.1 NW_018084961.1 NW_018084962.1 NW_018084963.1 NW_018084964.1 NW_018084965.1 NW_018084966.1 NW_018084967.1 NW_018084968.1 NW_018084969.1 NW_018084970.1 NW_018084971.1 NW_018084972.1 NW_018084973.1 NW_018084974.1 NW_018084975.1 NW_018084976.1 NW_018084977.1 NW_018084978.1 NW_018084979.1 NW_018084980.1 NW_018084981.1 NW_018084982.1 NW_018084983.1 NW_018084984.1 NW_018084985.1 NW_018084986.1 NW_018084987.1 NW_018084988.1 NW_018084989.1 NW_018084990.1 NW_018084991.1 NW_018084992.1 NW_018084993.1 NW_018084994.1 NW_018084995.1 NW_018084996.1 NW_018084997.1 NW_018084998.1 NW_018084999.1 NW_018085000.1 NW_018085001.1 NW_018085002.1 NW_018085003.1 NW_018085004.1 NW_018085005.1 NW_018085006.1 NW_018085007.1 NW_018085008.1 NW_018085009.1 NW_018085010.1 NW_018085011.1 NW_018085012.1 NW_018085013.1 NW_018085014.1 NW_018085015.1 NW_018085016.1 NW_018085017.1 NW_018085018.1 NW_018085019.1 NW_018085020.1 NW_018085021.1 NW_018085022.1 NW_018085023.1 NW_018085024.1 NW_018085025.1 NW_018085026.1 NW_018085027.1 NW_018085028.1 NW_018085029.1 NW_018085030.1 NW_018085031.1 NW_018085032.1 NW_018085033.1 NW_018085034.1 NW_018085035.1 NW_018085036.1 NW_018085037.1 NW_018085038.1 NW_018085039.1 NW_018085040.1 NW_018085041.1 NW_018085042.1 NW_018085043.1 NW_018085044.1 NW_018085045.1 NW_018085046.1 NW_018085047.1 NW_018085048.1 NW_018085049.1 NW_018085050.1 NW_018085051.1 NW_018085052.1 NW_018085053.1 NW_018085054.1 NW_018085055.1 NW_018085056.1 NW_018085057.1 NW_018085058.1 NW_018085059.1 NW_018085060.1 NW_018085061.1 NW_018085062.1 NW_018085063.1 NW_018085064.1 NW_018085065.1 NW_018085066.1 NW_018085067.1 NW_018085068.1 NW_018085069.1 NW_018085070.1 NW_018085071.1 NW_018085072.1 NW_018085073.1 NW_018085074.1 NW_018085075.1 NW_018085076.1 NW_018085077.1 NW_018085078.1 NW_018085079.1 NW_018085080.1 NW_018085081.1 NW_018085082.1 NW_018085083.1 NW_018085084.1 NW_018085085.1 NW_018085086.1 NW_018085087.1 NW_018085088.1 NW_018085089.1 NW_018085090.1 NW_018085091.1 NW_018085092.1 NW_018085093.1 NW_018085094.1 NW_018085095.1 NW_018085096.1 NW_018085097.1 NW_018085098.1 NW_018085099.1 NW_018085100.1 NW_018085101.1 NW_018085102.1 NW_018085103.1 NW_018085104.1 NW_018085105.1 NW_018085106.1 NW_018085107.1 NW_018085108.1 NW_018085109.1 NW_018085110.1 NW_018085111.1 NW_018085112.1 NW_018085113.1 NW_018085114.1 NW_018085115.1 NW_018085116.1 NW_018085117.1 NW_018085118.1 NW_018085119.1 NW_018085120.1 NW_018085121.1 NW_018085122.1 NW_018085123.1 NW_018085124.1 NW_018085125.1 NW_018085126.1 NW_018085127.1 NW_018085128.1 NW_018085129.1 NW_018085130.1 NW_018085131.1 NW_018085132.1 NW_018085133.1 NW_018085134.1 NW_018085135.1 NW_018085136.1 NW_018085137.1 NW_018085138.1 NW_018085139.1 NW_018085140.1 NW_018085141.1 NW_018085142.1 NW_018085143.1 NW_018085144.1 NW_018085145.1 NW_018085146.1 NW_018085147.1 NW_018085148.1 NW_018085149.1 NW_018085150.1 NW_018085151.1 NW_018085152.1 NW_018085153.1 NW_018085154.1 NW_018085155.1 NW_018085156.1 NW_018085157.1 NW_018085158.1 NW_018085159.1 NW_018085160.1 NW_018085161.1 NW_018085162.1 NW_018085163.1 NW_018085164.1 NW_018085165.1 NW_018085166.1 NW_018085167.1 NW_018085168.1 NW_018085169.1 NW_018085170.1 NW_018085171.1 NW_018085172.1 NW_018085173.1 NW_018085174.1 NW_018085175.1 NW_018085176.1 NW_018085177.1 NW_018085178.1 NW_018085179.1 NW_018085180.1 NW_018085181.1 NW_018085182.1 NW_018085183.1 NW_018085184.1 NW_018085185.1 NW_018085186.1 NW_018085187.1 NW_018085188.1 NW_018085189.1 NW_018085190.1 NW_018085191.1 NW_018085192.1 NW_018085193.1 NW_018085194.1 NW_018085195.1 NW_018085196.1 NW_018085197.1 NW_018085198.1 NW_018085199.1 NW_018085200.1 NW_018085201.1 NW_018085202.1 NW_018085203.1 NW_018085204.1 NW_018085205.1 NW_018085206.1 NW_018085207.1 NW_018085208.1 NW_018085209.1 NW_018085210.1 NW_018085211.1 NW_018085212.1 NW_018085213.1 NW_018085214.1 NW_018085215.1 NW_018085216.1 NW_018085217.1 NW_018085218.1 NW_018085219.1 NW_018085220.1 NW_018085221.1 NW_018085222.1 NW_018085223.1 NW_018085224.1 NW_018085225.1 NW_018085226.1 NW_018085227.1 NW_018085228.1 NW_018085229.1 NW_018085230.1 NW_018085231.1 NW_018085232.1 NW_018085233.1 NW_018085234.1 NW_018085235.1 NW_018085236.1 NW_018085237.1 NW_018085238.1 NW_018085239.1 NW_018085240.1 NW_018085241.1 NW_018085242.1 NW_018085243.1 NW_018085244.1 NW_018085245.1 NW_018085246.1 NW_018085247.1 NW_018085248.1 NW_018085249.1 NW_018085250.1 NW_018085251.1 NW_018085252.1 NW_018085253.1 NW_018085254.1 NW_018085255.1 NW_018085256.1 NW_018085257.1 NW_018085258.1 NW_018085259.1 NW_018085260.1 NW_018085261.1 NW_018085262.1 NW_018085263.1 NW_018085264.1 NW_018085265.1 NW_018085266.1 NW_018085267.1 NW_018085268.1 NW_018085269.1 NW_018085270.1 NW_018085271.1 NW_018085272.1 NW_018085273.1 NW_018085274.1 NW_018085275.1 NW_018085276.1 NW_018085277.1 NW_018085278.1 NW_018085279.1 NW_018085280.1 NW_018085281.1 NW_018085282.1 NW_018085283.1 NW_018085284.1 NW_018085285.1 NW_018085286.1 NW_018085287.1 NW_018085288.1 NW_018085289.1 NW_018085290.1 NW_018085291.1 NW_018085292.1 NW_018085293.1 NW_018085294.1 NW_018085295.1 NW_018085296.1 NW_018085297.1 NW_018085298.1 NW_018085299.1 NW_018085300.1 NW_018085301.1 NW_018085302.1 NW_018085303.1 NW_018085304.1 NW_018085305.1 NW_018085306.1 NW_018085307.1 NW_018085308.1 NW_018085309.1 NW_018085310.1 NW_018085311.1 NW_018085312.1 NW_018085313.1 NW_018085314.1 NW_018085315.1 NW_018085316.1 NW_018085317.1 NW_018085318.1 NW_018085319.1 NW_018085320.1 NW_018085321.1 NW_018085322.1 NW_018085323.1 NW_018085324.1 NW_018085325.1 NW_018085326.1 NW_018085327.1 NW_018085328.1 NW_018085329.1 NW_018085330.1 NW_018085331.1 NW_018085332.1 NW_018085333.1 NW_018085334.1 NW_018085335.1 NW_018085336.1 NW_018085337.1 NW_018085338.1 NW_018085339.1 NW_018085340.1 NW_018085341.1 NW_018085342.1 NW_018085343.1 NW_018085344.1 NW_018085345.1 NW_018085346.1 NW_018085347.1 NW_018085348.1 NW_018085349.1 NW_018085350.1 NW_018085351.1 NW_018085352.1 NW_018085353.1 NW_018085354.1 NW_018085355.1 NW_018085356.1 NW_018085357.1 NW_018085358.1 NW_018085359.1 NW_018085360.1 NW_018085361.1 NW_018085362.1 NW_018085363.1 NW_018085364.1 NW_018085365.1 NW_018085366.1 NW_018085367.1 NW_018085368.1 NC_000845.1 [71420] [2024-04-03 09:10:42.642] [GLnexus] [info] db_get_contigs GLnexus.DB [71420] [2024-04-03 09:10:42.789] [GLnexus] [info] Beginning bulk load with no range filter. [71420] [2024-04-03 10:09:37.111] [GLnexus] [info] Loaded 5 datasets with 5 samples; 846365851832 bytes in 8659464665 BCF records (882 duplicate) in 414215 buckets. Bucket max 2856376 bytes, 28997 records. 0 BCF records skipped due to caller-specific exceptions [71420] [2024-04-03 10:09:37.141] [GLnexus] [info] Created sample set @5 [71420] [2024-04-03 10:09:37.142] [GLnexus] [info] Flushing database... [71420] [2024-04-03 10:11:17.432] [GLnexus] [info] Bulk load complete! [71420] [2024-04-03 10:11:17.482] [GLnexus] [warning] Processing full length of 613 contigs, as no --bed was provided. Providing a BED file with regions of interest, if applicable, can speed this up. [71420] [2024-04-03 10:11:17.509] [GLnexus] [info] found sample set @5 [71420] [2024-04-03 10:11:17.509] [GLnexus] [info] discovering alleles in 613 range(s) on 126 threads [71420] [2024-04-03 10:17:12.989] [GLnexus] [info] discovered 3689057 alleles [71420] [2024-04-03 10:17:15.093] [GLnexus] [info] unified to 159191 sites cleanly with 159684 ALT alleles. 1 ALT alleles were additionally included in monoallelic sites and 1704795 were filtered out on quality thresholds. [71420] [2024-04-03 10:17:15.093] [GLnexus] [info] Finishing database compaction... [71420] [2024-04-03 10:17:17.832] [GLnexus] [info] genotyping 159191 sites; sample set = *@5 mem_budget = 0 threads = 128 [71420] [2024-04-03 10:19:30.901] [GLnexus] [info] genotyping complete! [71420] [2024-04-03 10:19:30.917] [GLnexus] [info] worker threads were cumulatively stalled for 169810ms [71420] [2024-04-03 10:19:30.917] [GLnexus] [info] Num BCF records read 5557229 query hits 1178413 INFO: Cleaning up image...
AndrewCarroll commented 5 months ago

Hi @Modernism-01

For ONT data can you try the merge set DeepVariant_unfiltered. The presets for DeepVariantWGS were determined based on Illumina WGS. I hope that will help recover ONT variants that are too aggressively filtered. If this is not the case, you could please report back here.

Thank you, Andrew