immunogenomics / HLA-TAPAS

HLA-TAPAS pipeline for HLA association and fine-mapping studies
47 stars 21 forks source link

Inconsistent marker IDs - Self-made 1kg subset #25

Open erichards52 opened 1 year ago

erichards52 commented 1 year ago

Hello,

First of all thank you for your efforts in making this software :)

I am currently having an issue when I am attempting to use MakeReference to make my own reference using a subset of the 1kg samples (ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.vcf.gz) available here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/

I am removing 48 samples that correlate between the GET-RM studies and 1000 genomes in order to create a reference that will enable me to evaluate HLA-TAPAS' ability to perform HLA calling via the SNP2HLA component against other software(s).

In order to remove these 48 samples and create the essential PLINK files required by MakeReference, I perform the following:

vcftools --remove get_rm_1kg_shared.txt --gzvcf ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.vcf.gz --recode --out FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.vcf.gz

vcftools --vcf FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.vcf.gz.recode.vcf --out FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes --plink

plink --vcf FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.vcf.gz.recode.vcf --out FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes

This creates the bed, bim, fam, map and ped files.

However, to replace the PED file this creates, I am using the PED file found here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HLA_types/20181129_HLA_types_full_1000_Genomes_Project_panel.txt and replacing the necessary rows in order to run NomenCleaner. My PED file I feed to NomenCleaner looks like this:

HG01879 HG01879 0       0       0       0       A*23:01 A*68:02 B*13:02 B*42:01 C*08:04 C*17:01 0       0       0       0       0   DQB1*02:02       DQB1*04:02      DRB1*03:02      DRB1*09:01

I filter any lines starting with the 48 samples present within the PED file before I run NomenCleaner:

while read p; do
  sed -i '/^'${p}'/d' ped_file.ped
done <get_rm_1kg_shared.txt

NomenCleaner runs perfectly fine and generates a CHPED, but when I attempt to run MakeReference like this:

python -m MakeReference --variants /data/1kg_files_downloaded/get_rm_1kg_shared/omni_gvcf/FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes  \
--chped /data/1kg_files_downloaded/get_rm_1kg_shared/omni_gvcf /FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.chped \ 
--hg 19 \
--out /data/1kg_files_downloaded/get_rm_1kg_shared/omni_gvcf/makeref_test/FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.bglv4 \
 --dict-AA MakeReference/data/hg19/HLA_DICTIONARY_AA.hg19.imgt3320 \
--dict-SNPS MakeReference/data/hg19/HLA_DICTIONARY_SNPS.hg19.imgt3320 \
--phasing

I get the following error:

[MakeReference_v2.py]: Making Reference Panel for "/data/1kg_files_downloaded/get_rm_1kg_shared/omni_gvcf/makeref_test/FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.bglv4"
[1] Generating Amino acid(AA)sequences from HLA types.
[2] Encoding Amino acids positions.
[3] Encoding HLA alleles.
[4] Generating DNA(SNPS) sequences from HLA types.
[5] Encoding SNP positions.
[6] Extracting founders.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
[7] Merging SNP, HLA, and amino acid datasets.
Warning: Variants 'HLA_A*01:01' and 'HLA_A*01' have the same position.
Warning: Variants 'HLA_A*01:02' and 'HLA_A*01:01' have the same position.
Warning: Variants 'HLA_A*01:89' and 'HLA_A*01:02' have the same position.
730 more same-position warnings: see log file.
[8] Performing quality control.
[9] Preparing files for Beagle.
[10] Converting PLINK to BEAGLE format.
[11] Converting BEAGLE to VCF format.
Exception in thread "main" java.lang.IllegalArgumentException: Inconsistent marker IDs: [markers file]=MitoT217C [BEAGLE file]=SNP1-713781
        at beagleutil.Beagle2Vcf.checkConsistency(Beagle2Vcf.java:153)
        at beagleutil.Beagle2Vcf.main(Beagle2Vcf.java:67)
[12] Phasing reference using Beagle4.1.

[MakeReference_v2.py::ERROR]: Phasing failed. See log file('/data/1kg_files_downloaded/get_rm_1kg_shared/omni_gvcf/makeref_test/FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.bglv4.bgl.phased.vcf.log').

With the log file /data/1kg_files_downloaded/get_rm_1kg_shared/omni_gvcf/makeref_test/FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.bglv4.bgl.phased.vcf.log reading:

Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jun16.b16.jar" for a summary of command line arguments.
Start time: 12:28 PM UTC on 10 Sep 2023

Command line: java -Xmx20480m -jar beagle.jar
  gt=/data/1kg_files_downloaded/get_rm_1kg_shared/omni_gvcf/makeref_test/FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.bglv4.bgl.vcf
  impute=false
  nthreads=1
  niterations=5
  lowmem=true
  out=/data/1kg_files_downloaded/get_rm_1kg_shared/omni_gvcf/makeref_test/FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.bglv4.bgl.phased

No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: Missing line (#CHROM ...) after meta-information lines
File source: /data/1kg_files_downloaded/get_rm_1kg_shared/omni_gvcf/makeref_test/FILTERED_ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.bglv4.bgl.vcf
null
        at vcf.VcfHeader.checkHeaderLine(VcfHeader.java:135)
        at vcf.VcfHeader.<init>(VcfHeader.java:119)
        at vcf.VcfIt.<init>(VcfIt.java:190)
        at vcf.VcfIt.create(VcfIt.java:175)
        at vcf.VcfIt.create(VcfIt.java:150)
        at main.Main.nonRefData(Main.java:264)
        at main.Main.main(Main.java:111)

Which I assume is caused by the preceding error, as there is a #CHROM line present within the VCF file being used.

If you could shine some light one what the issue is here regarding the inconsistent markers, that would be very much appreciated!

Thank you!