rwdavies / QUILT

GNU General Public License v3.0
45 stars 10 forks source link

Error in generating my own HLA reference panel:subscript out of bounds #18

Closed beifanglingyang closed 2 years ago

beifanglingyang commented 2 years ago

Hi Robbie, I followed the instructions(https://github.com/rwdavies/QUILT/blob/master/example/QUILT_hla_reference_panel_construction.Md) and produced one successful reference package.However, it gave a subscript out of bounds error when I replaced the reference haplotype data. Attached is the error message : `[2022-05-27 10:04:03] Running QUILT_HLA_prepare_reference(outputdir = quilt_hla_reference_panel_10KG_files, nGen = 100, hla_types_panel = 20181129_HLA_types_full_1000_Genomes_Project_panel.txt, ipd_igmt_alignments_zip_file = Alignments_Rel_3480.zip, ref_fasta = newGRch38/GRCh38_full_analysis_set_plus_decoy_hla.fa, refseq_table_file = hla_ancillary_files/refseq.hg38.chr6.26000000.34000000.txt.gz, full_regionStart = 25587319, full_regionEnd = 33629686, buffer = 500000, region_exclude_file = hla_ancillary_files/hlagenes.txt, genetic_map_file = CHS_chr6.txt.gz, reference_haplotype_file = /BIGDATA/USER/wei/referencepanel/quiltrefren/new10KG/10KG.hap.gz, reference_legend_file = /BIGDATA/USER/wei/referencepanel/quiltrefren/new10KG/10KG.legend.gz, reference_sample_file = /BIGDATA/USER/wei/referencepanel/quiltrefren/new10KG/10KG.samples, reference_exclude_samplelist_file = exclude_ref_samples_for_testing.txt, reference_exclude_samples_for_initial_phasing = FALSE, all_hla_regions = c("A", "B", "C", "DMA", "DMB", "DOA", "DOB", "DPA1", "DPA2", "DPB1", "DPB2", "DQA1", "DQA2", "DQB1", "DRA", "DRB1", "DRB3", "DRB4", "DRB5", "E", "F", "G", "HFE", "H", "J", "K", "L", "MICA", "MICB", "N", "P", "S", "TAP1", "TAP2", "T", "U", "V", "W", "Y"), hla_regions_to_prepare = c("A", "B", "C", "DQB1", "DRB1"), chr = chr6, minRate = 0.1, nCores = 6) sending incremental file list Alignments_Rel_3480.zip

sent 8868604 bytes received 31 bytes 5912423.33 bytes/sec total size is 8867405 speedup is 1.00 Archive: Alignments_Rel_3480.zip creating: alignments/ inflating: MACOSX/._alignments
inflating: alignments/K_nuc.txt
inflating:
MACOSX/alignments/._K_nuc.txt
inflating: alignments/DPA1_gen.txt
inflating: MACOSX/alignments/._DPA1_gen.txt
inflating: alignments/DQB1_prot.txt
inflating: __MACOSX/alignments/._DQB1_prot.txt
inflating: alignments/A_prot.txt
inflating:
MACOSX/alignments/._A_prot.txt
inflating: alignments/DRB4_nuc.txt
inflating: MACOSX/alignments/._DRB4_nuc.txt
inflating: alignments/DOA_nuc.txt
inflating: __MACOSX/alignments/._DOA_nuc.txt
inflating: alignments/W_nuc.txt
inflating:
MACOSX/alignments/._W_nuc.txt
inflating: alignments/B_nuc.txt
inflating: MACOSX/alignments/._B_nuc.txt
inflating: alignments/S_gen.txt
inflating: __MACOSX/alignments/._S_gen.txt
inflating: alignments/DQA2_gen.txt
inflating:
MACOSX/alignments/._DQA2_gen.txt
inflating: alignments/F_gen.txt
inflating: MACOSX/alignments/._F_gen.txt
inflating: alignments/V_nuc.txt
inflating: __MACOSX/alignments/._V_nuc.txt
inflating: alignments/C_nuc.txt
inflating:
MACOSX/alignments/._C_nuc.txt
inflating: alignments/DPB2_nuc.txt
inflating: MACOSX/alignments/._DPB2_nuc.txt
inflating: alignments/HFE_nuc.txt
inflating: __MACOSX/alignments/._HFE_nuc.txt
inflating: alignments/G_gen.txt
inflating:
MACOSX/alignments/._G_gen.txt
inflating: alignments/DRB5_prot.txt
inflating: MACOSX/alignments/._DRB5_prot.txt
inflating: alignments/DRB4_prot.txt
inflating: __MACOSX/alignments/._DRB4_prot.txt
inflating: alignments/J_nuc.txt
inflating:
MACOSX/alignments/._J_nuc.txt
inflating: alignments/DRB1_gen.txt
inflating: MACOSX/alignments/._DRB1_gen.txt
inflating: alignments/DMB_gen.txt
inflating: __MACOSX/alignments/._DMB_gen.txt
inflating: alignments/N_gen.txt
inflating:
MACOSX/alignments/._N_gen.txt
inflating: alignments/TAP2_nuc.txt
inflating: MACOSX/alignments/._TAP2_nuc.txt
inflating: alignments/DQB1_nuc.txt
inflating: __MACOSX/alignments/._DQB1_nuc.txt
inflating: alignments/DRB5_nuc.txt
inflating:
MACOSX/alignments/._DRB5_nuc.txt
inflating: alignments/DRB_prot.txt
inflating: MACOSX/alignments/._DRB_prot.txt
inflating: alignments/MICB_gen.txt
inflating: __MACOSX/alignments/._MICB_gen.txt
inflating: alignments/A_nuc.txt
inflating:
MACOSX/alignments/._A_nuc.txt
inflating: alignments/DOB_nuc.txt
inflating: MACOSX/alignments/._DOB_nuc.txt
inflating: alignments/T_nuc.txt
inflating: __MACOSX/alignments/._T_nuc.txt
inflating: alignments/DQA1_gen.txt
inflating:
MACOSX/alignments/._DQA1_gen.txt
inflating: alignments/E_gen.txt
inflating: MACOSX/alignments/._E_gen.txt
inflating: alignments/DMB_prot.txt
inflating: __MACOSX/alignments/._DMB_prot.txt
inflating: alignments/P_gen.txt
inflating:
MACOSX/alignments/._P_gen.txt
inflating: alignments/DRB3_gen.txt
inflating: MACOSX/alignments/._DRB3_gen.txt
inflating: alignments/DRB3_prot.txt
inflating: __MACOSX/alignments/._DRB3_prot.txt
inflating: alignments/H_nuc.txt
inflating:
MACOSX/alignments/._H_nuc.txt
inflating: alignments/MICA_prot.txt
inflating: MACOSX/alignments/._MICA_prot.txt
inflating: alignments/Y_gen.txt
inflating: __MACOSX/alignments/._Y_gen.txt
inflating: alignments/DRA_gen.txt
inflating:
MACOSX/alignments/._DRA_gen.txt
inflating: alignments/DPA2_gen.txt
inflating: MACOSX/alignments/._DPA2_gen.txt
inflating: alignments/L_gen.txt
inflating: __MACOSX/alignments/._L_gen.txt
inflating: alignments/DMA_gen.txt
inflating:
MACOSX/alignments/._DMA_gen.txt
inflating: alignments/DPA1_prot.txt
inflating: MACOSX/alignments/._DPA1_prot.txt
inflating: alignments/MICA_gen.txt
inflating: __MACOSX/alignments/._MICA_gen.txt
inflating: alignments/G_prot.txt
inflating:
MACOSX/alignments/._G_prot.txt
inflating: alignments/F_prot.txt
inflating: MACOSX/alignments/._F_prot.txt
inflating: alignments/TAP1_nuc.txt
inflating: __MACOSX/alignments/._TAP1_nuc.txt
inflating: alignments/DOA_prot.txt
inflating:
MACOSX/alignments/._DOA_prot.txt
inflating: alignments/DQA2_prot.txt
inflating: MACOSX/alignments/._DQA2_prot.txt
inflating: alignments/U_nuc.txt
inflating: __MACOSX/alignments/._U_nuc.txt
inflating: alignments/TAP2_prot.txt
inflating:
MACOSX/alignments/._TAP2_prot.txt
inflating: alignments/DPB1_nuc.txt
inflating: MACOSX/alignments/._DPB1_nuc.txt
inflating: alignments/A_gen.txt
inflating: __MACOSX/alignments/._A_gen.txt
inflating: alignments/DOB_gen.txt
inflating:
MACOSX/alignments/._DOB_gen.txt
inflating: alignments/T_gen.txt
inflating: MACOSX/alignments/._T_gen.txt
inflating: alignments/DQA1_nuc.txt
inflating: __MACOSX/alignments/._DQA1_nuc.txt
inflating: alignments/E_nuc.txt
inflating:
MACOSX/alignments/._E_nuc.txt
inflating: alignments/DRA_prot.txt
inflating: MACOSX/alignments/._DRA_prot.txt
inflating: alignments/P_nuc.txt
inflating: __MACOSX/alignments/._P_nuc.txt
inflating: alignments/DRB3_nuc.txt
inflating:
MACOSX/alignments/._DRB3_nuc.txt
inflating: alignments/CLASSI_nuc.txt
inflating: MACOSX/alignments/._CLASSI_nuc.txt
inflating: alignments/H_gen.txt
inflating: __MACOSX/alignments/._H_gen.txt
inflating: alignments/DPA2_nuc.txt
inflating:
MACOSX/alignments/._DPA2_nuc.txt
inflating: alignments/CLASSI_prot.txt
inflating: MACOSX/alignments/._CLASSI_prot.txt
inflating: alignments/Y_nuc.txt
inflating: __MACOSX/alignments/._Y_nuc.txt
inflating: alignments/DRA_nuc.txt
inflating:
MACOSX/alignments/._DRA_nuc.txt
inflating: alignments/L_nuc.txt
inflating: MACOSX/alignments/._L_nuc.txt
inflating: alignments/HFE_prot.txt
inflating: __MACOSX/alignments/._HFE_prot.txt
inflating: alignments/DMA_nuc.txt
inflating:
MACOSX/alignments/._DMA_nuc.txt
inflating: alignments/MICA_nuc.txt
inflating: MACOSX/alignments/._MICA_nuc.txt
inflating: alignments/DPB1_prot.txt
inflating: __MACOSX/alignments/._DPB1_prot.txt
inflating: alignments/TAP1_gen.txt
inflating:
MACOSX/alignments/._TAP1_gen.txt
inflating: alignments/U_gen.txt
inflating: MACOSX/alignments/._U_gen.txt
inflating: alignments/DPB1_gen.txt
inflating: __MACOSX/alignments/._DPB1_gen.txt
inflating: alignments/C_prot.txt
inflating:
MACOSX/alignments/._C_prot.txt
inflating: alignments/B_prot.txt
inflating: MACOSX/alignments/._B_prot.txt
inflating: alignments/DOB_prot.txt
inflating: __MACOSX/alignments/._DOB_prot.txt
inflating: alignments/K_gen.txt
inflating:
MACOSX/alignments/._K_gen.txt
inflating: alignments/DQA1_prot.txt
inflating: MACOSX/alignments/._DQA1_prot.txt
inflating: alignments/TAP1_prot.txt
inflating: __MACOSX/alignments/._TAP1_prot.txt
inflating: alignments/DRB4_gen.txt
inflating:
MACOSX/alignments/._DRB4_gen.txt
inflating: alignments/DRB_nuc.txt
inflating: MACOSX/alignments/._DRB_nuc.txt
inflating: alignments/DPA1_nuc.txt
inflating: __MACOSX/alignments/._DPA1_nuc.txt
inflating: alignments/DOA_gen.txt
inflating:
MACOSX/alignments/._DOA_gen.txt
inflating: alignments/W_gen.txt
inflating: MACOSX/alignments/._W_gen.txt
inflating: alignments/B_gen.txt
inflating: __MACOSX/alignments/._B_gen.txt
inflating: alignments/S_nuc.txt
inflating:
MACOSX/alignments/._S_nuc.txt
inflating: alignments/DQA2_nuc.txt
inflating: MACOSX/alignments/._DQA2_nuc.txt
inflating: alignments/F_nuc.txt
inflating: __MACOSX/alignments/._F_nuc.txt
inflating: alignments/E_prot.txt
inflating:
MACOSX/alignments/._E_prot.txt
inflating: alignments/DRB1_prot.txt
inflating: MACOSX/alignments/._DRB1_prot.txt
inflating: alignments/V_gen.txt
inflating: __MACOSX/alignments/._V_gen.txt
inflating: alignments/MICB_prot.txt
inflating:
MACOSX/alignments/._MICB_prot.txt
inflating: alignments/C_gen.txt
inflating: MACOSX/alignments/._C_gen.txt
inflating: alignments/HFE_gen.txt
inflating: __MACOSX/alignments/._HFE_gen.txt
inflating: alignments/DPB2_gen.txt
inflating:
MACOSX/alignments/._DPB2_gen.txt
inflating: alignments/G_nuc.txt
inflating: MACOSX/alignments/._G_nuc.txt
inflating: alignments/J_gen.txt
inflating: __MACOSX/alignments/._J_gen.txt
inflating: alignments/DMB_nuc.txt
inflating:
MACOSX/alignments/._DMB_nuc.txt
inflating: alignments/DRB1_nuc.txt
inflating: MACOSX/alignments/._DRB1_nuc.txt
inflating: alignments/DQB1_gen.txt
inflating: __MACOSX/alignments/._DQB1_gen.txt
inflating: alignments/TAP2_gen.txt
inflating:
MACOSX/alignments/._TAP2_gen.txt
inflating: alignments/DMA_prot.txt
inflating: MACOSX/alignments/._DMA_prot.txt
inflating: alignments/N_nuc.txt
inflating: __MACOSX/alignments/._N_nuc.txt
inflating: alignments/DRB5_gen.txt
inflating:
MACOSX/alignments/._DRB5_gen.txt
inflating: alignments/MICB_nuc.txt
inflating: __MACOSX/alignments/._MICB_nuc.txt
[2022-05-27 10:04:08] Begin making HLA all alleles kmers file [2022-05-27 10:04:08] Processing file 1 out of 39 [2022-05-27 10:05:11] Processing file 4 out of 39 [2022-05-27 10:05:15] Processing file 8 out of 39 [2022-05-27 10:06:11] Processing file 12 out of 39 [2022-05-27 10:06:41] Processing file 16 out of 39 [2022-05-27 10:10:43] Processing file 20 out of 39 [2022-05-27 10:10:49] Processing file 24 out of 39 [2022-05-27 10:10:53] Processing file 28 out of 39 [2022-05-27 10:13:15] Processing file 32 out of 39 [2022-05-27 10:13:23] Processing file 36 out of 39 [2022-05-27 10:13:47] Done making HLA all alleles kmers file [2022-05-27 10:13:47] Begin making HLA snp format alleles file [2022-05-27 10:13:47] Working on region:A [2022-05-27 10:13:47] Working on region:B [2022-05-27 10:13:47] Working on region:C [2022-05-27 10:13:47] Working on region:DQB1 [2022-05-27 10:13:47] Working on region:DRB1 [2022-05-27 10:13:52] Determined automatically that the reference genome contains allele:DQB106:02:01:01 [2022-05-27 10:13:52] Make SNP format alleles file for:DQB1 [2022-05-27 10:13:54] Determined automatically that the reference genome contains allele:DRB115:01:01:01 [2022-05-27 10:13:54] Make SNP format alleles file for:DRB1 [2022-05-27 10:13:56] Determined automatically that the reference genome contains allele:A03:01:01:01 [2022-05-27 10:13:56] Make SNP format alleles file for:A [2022-05-27 10:14:02] Determined automatically that the reference genome contains allele:C07:02:01:03 [2022-05-27 10:14:02] Make SNP format alleles file for:C [2022-05-27 10:14:02] Determined automatically that the reference genome contains allele:B*07:02:01:01 [2022-05-27 10:14:02] Make SNP format alleles file for:B [2022-05-27 10:14:36] Make full alleles filled in file for:DQB1 [2022-05-27 10:14:41] Make final files:DQB1 [2022-05-27 10:14:52] Done working on region:DQB1 [2022-05-27 10:16:11] Make full alleles filled in file for:A [2022-05-27 10:17:12] Make full alleles filled in file for:DRB1 [2022-05-27 10:17:21] Make final files:DRB1 [2022-05-27 10:17:53] Done working on region:DRB1 [2022-05-27 10:20:22] Make full alleles filled in file for:C [2022-05-27 10:21:04] Make final files:C [2022-05-27 10:21:20] Make full alleles filled in file for:B [2022-05-27 10:21:37] Done working on region:C [2022-05-27 10:24:09] Make final files:A [2022-05-27 10:24:40] Done working on region:A [2022-05-27 10:33:04] Make final files:B [2022-05-27 10:33:34] Done working on region:B [2022-05-27 10:33:34] Done making HLA snp format alleles file [2022-05-27 10:33:34] Determine who to remove from full panel [2022-05-27 10:33:34] Done determining who to remove from full panel [2022-05-27 10:33:34] Running QUILT_prepare_reference(outputdir = quilt_hla_reference_panel_10KG_files, chr = chr6, nGen = 100, regionStart = 25587319, regionEnd = 33629686, buffer = 500000, output_file = quilt_hla_reference_panel_10KG_files/quilt.hrc.hla.all.haplotypes.RData, reference_haplotype_file = /BIGDATA/USER/wei/referencepanel/quiltrefren/new10KG/10KG.hap.gz, reference_legend_file = /BIGDATA/USER/wei/referencepanel/quiltrefren/new10KG/10KG.legend.gz, reference_sample_file = /BIGDATA/USER/wei/referencepanel/quiltrefren/new10KG/10KG.samples, reference_populations = NA, reference_phred = 30, reference_exclude_samplelist_file = quilt_hla_reference_panel_10KG_files/hlauntyped.exclude.txt, region_exclude_file = , genetic_map_file = CHS_chr6.txt.gz, nMaxDH = NA, tempdir = NA, make_fake_vcf_with_sites_list = FALSE, output_sites_filename = NA, expRate = 1, maxRate = 100, minRate = 0.1) [2022-05-27 10:33:34] Program start [2022-05-27 10:33:34] Begin converting reference haplotypes [2022-05-27 10:33:35] Begin get haplotypes from reference [2022-05-27 10:33:35] Load and validate reference legend header [2022-05-27 10:33:35] Load and validate reference legend [2022-05-27 10:33:36] Extract reference haplotypes [2022-05-27 10:33:36] In the region to be imputed plus buffer there are the following number of SNPs: [2022-05-27 10:33:36] 81159 from the posfile [2022-05-27 10:33:36] 82139 from the reference haplotypes [2022-05-27 10:33:36] 81159 in the intersection of the posfile and the reference haplotypes [2022-05-27 10:34:59] Succesfully extracted 19928 haplotypes from reference data [2022-05-27 10:34:59] End get haplotypes from reference [2022-05-27 10:34:59] Getting and validating genetic map [2022-05-27 10:35:21] Match genetic map to desired imputation positions [2022-05-27 10:35:21] Done with getting genetic map [2022-05-27 10:35:46] Using nMaxDH = 15 [2022-05-27 10:35:46] Save converted reference haplotypes [2022-05-27 10:35:46] Done converting reference haplotypes [2022-05-27 10:35:46] Begin assigning HLA types to haplotypes [2022-05-27 10:35:46] Work on phasing in region:A Error in cbind(ourtypes1, ourtypes2, dist11, dist12, dist21, dist22, phase1, : subscript out of bounds Calls: QUILT_HLA_prepare_reference -> phase_hla_haplotypes -> hla_perform_step_1_phasing Execution halted processs will sleep 30s process end at : Fri May 27 10:36:25 CST 2022`

The difference between my operation and the correct example was only that I use my own reference haplotype data and re-format it to --haplegendsample: (1)The reference haplotype data I used(chr6.CGP.beagle52.filter_MAF.0.001.vcf.gz): image

From the results, it seems that I generated the hap/legend/sample file correctly. (2)The code I used: module load tabix/0.2.6 module load bcftools/1.9-gcc-4.8.5 bcftools view --output-file 10KG.temp.vcf.gz --output-type z --min-alleles 2 --max-alleles 2 --types snps chr6.CGP.beagle52.filter_MAF.0.001.vcf.gz -r chr6:25000000-34000000 tabix 10KG.temp.vcf.gz bcftools convert --haplegendsample 10KG 10KG.temp.vcf.gz (3)The result of it : process will start at : Fri May 27 09:46:47 CST 2022 ++++++++++++++++++++++++++++++++++++++++ Hap file: 10KG.hap.gz Legend file: 10KG.legend.gz Sample file: 10KG.samples 82695 records written, 0 skipped: 0/0/0 no-ALT/non-biallelic/filtered tbx_index_build failed: oneKG.temp.vcf.gz processs will sleep 30s process end at : Fri May 27 09:53:50 CST 2022

I can't figure out where the problem is, it seems that the file(Reference haplotype data including HLA alleles database) needs to meet certain conditions? Thanks very much!

rwdavies commented 2 years ago

Hi,

Thanks for the message. Sorry for the error. In general the code to build HLA reference sets isn't great as it lacks test coverage, and time commitments prevent me from comprehensively going over the code to minimize bugs, so unfortunately errors like this can happen that need to be sorted out.

Now, it looks like this might be potentially an easy bug The line mentionned above with cbind(ourtypes1, ourtypes2, dist11, dist12, dist21, dist22, phase1 only appears twice uncommented with the first of these basically useless as the product (temp) is never referenced https://github.com/rwdavies/QUILT/blob/master/QUILT/R/hla_prepare_phase_functions.R#L514 and the second also doesn't seem to be used https://github.com/rwdavies/QUILT/blob/master/QUILT/R/hla_prepare_phase_functions.R#L629

I've gone ahead and commented those two lines out, and committed the change https://github.com/rwdavies/QUILT/commit/0eb3c7c9337c3c99cd1e6168fe6b98c7d1fd06d9

So if you'd like you could try installing from github through

cd ~/proj/QUILT/ ## or similar
./scripts/build-and-install.R

and then seeing if that fixes the problem

Thanks Robbie

beifanglingyang commented 2 years ago

Robbie, Thank you very much for your prompt reply ! I'm happy to tell you that, your revised new version solves the problem perfectly after testing. And I'm sorry that the server was unavailable for several days, so I didn't get back to you in time.