rwdavies / QUILT

GNU General Public License v3.0
45 stars 10 forks source link

BAM causing error with QUILT-HLA #10

Open danielanach opened 2 years ago

danielanach commented 2 years ago

Hello! One more problem I am running into.

QUILT-HLA is working great on the test BAM files but I have been having some troubles with my BAM files.

Here I am taking the same command that worked for the test BAM files downloaded and provided in: QUILT_hla_reference_panel_construction.Md and used some lpWGS BAM files that had been aligned with hg38.

[2021-12-20 11:34:18] Running QUILT_HLA(bamlist = /cellar/users/dnachman/programs/QUILT2/input_dir/bamlist.txt, region = A, hla_gene_region_file = hla_ancillary_files/hlagenes.txt, dict_file = hla_ancillary_files/GRCh38_full_analysis_set_plus_decoy_hla.dict, outputdir = /cellar/users/dnachman/programs/QUILT2/input_dir/, summary_output_file_prefix = quilt.hla.output, nCores = 1, prepared_hla_reference_dir = /cellar/users/dnachman/programs/QUILT2/quilt_hla_reference_panel_files_2021_12_19/, quilt_hla_haplotype_panelfile = /cellar/users/dnachman/programs/QUILT2/quilt_hla_reference_panel_files_2021_12_19/quilt.hrc.hla.A.haplotypes.RData, final_output_RData_file = NA, write_summary_text_files = TRUE, overrideoutput = FALSE, nGibbsSamples = 15, n_seek_iterations = 3, quilt_seed = NA, chr = chr6, quilt_buffer = 500000, quilt_bqFilter = 10, summary_best_alleles_threshold = 0.99)
[2021-12-20 11:34:18] Begin QUILT-HLA
[2021-12-20 11:34:18] Load input files
[2021-12-20 11:34:27] The number of HLA reference haplotypes is:5032
[2021-12-20 11:34:30] Running QUILT(outputdir = /tmp/RtmpMDvKE4, chr = chr6, regionStart = 29942554, regionEnd = 29945741, buffer = 500000, bamlist = /cellar/users/dnachman/programs/QUILT2/input_dir/bamlist.txt, cramlist = , sampleNames_file = , reference = , nCores = 1, nGibbsSamples = 15, n_seek_its = 3, Ksubset = 400, Knew = 100, K_top_matches = 5, output_gt_phased_genotypes = TRUE, heuristic_match_thin = 0.1, output_filename = NULL, RData_objects_to_save = c("sampleNames", "final_set_of_results"), output_RData_filename = /tmp/RtmpMDvKE4/file10d9341d09d54d, prepared_reference_filename = /cellar/users/dnachman/programs/QUILT2/quilt_hla_reference_panel_files_2021_12_19/quilt.hrc.hla.A.haplotypes.RData, save_prepared_reference = FALSE, tempdir = NA, bqFilter = 10, panel_size = NA, posfile = , genfile = , phasefile = , maxDifferenceBetweenReads = 1e+10, make_plots = FALSE, verbose = FALSE, shuffle_bin_radius = 5000, iSizeUpperLimit = 1e+06, record_read_label_usage = FALSE, record_interim_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 50000, addOptimalHapsToVCF = FALSE, estimate_bq_using_truth_read_labels = FALSE, override_default_params_for_small_ref_panel = TRUE, gamma_physically_closest_to = 29944147.5, seed = NA, hla_run = TRUE, downsampleToCov = 30, minGLValue = 1e-10, minimum_number_of_sample_reads = 2, nGen = NA, reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 30, reference_exclude_samplelist_file = , region_exclude_file = , genetic_map_file = , nMaxDH = NA, make_fake_vcf_with_sites_list = FALSE, output_sites_filename = NA, expRate = 1, maxRate = 100, minRate = 0.1, print_extra_timing_information = FALSE, block_gibbs_iterations = c(3, 6, 9), n_gibbs_burn_in_its = 20, plot_per_sample_likelihoods = FALSE, use_small_eHapsCurrent_tc = FALSE)
[2021-12-20 11:34:31] Get BAM sample names
[2021-12-20 11:34:31] Done getting BAM sample names
[2021-12-20 11:34:31] There are 28730 SNPs in this region
[2021-12-20 11:34:31] Imputing sample: 1
[2021-12-20 11:34:36] The average depth of this sample is:0.829529978162397
[2021-12-20 11:34:36] There are 3963 reads under consideration
[2021-12-20 11:35:17] Phasing it
[2021-12-20 11:35:20] Imputing sample: 2
[2021-12-20 11:35:25] The average depth of this sample is:1.00663503699322
[2021-12-20 11:35:25] There are 4830 reads under consideration
[2021-12-20 11:36:10] Phasing it
[2021-12-20 11:36:13] Imputing sample: 3
[2021-12-20 11:36:18] The average depth of this sample is:0.74432989298443
[2021-12-20 11:36:18] There are 3728 reads under consideration
[2021-12-20 11:36:58] Phasing it
[2021-12-20 11:37:01] Begin making and writing output file
[2021-12-20 11:37:01] bgzip output file
[2021-12-20 11:37:01] Done making and writing output file
[2021-12-20 11:37:01] Begin saving extra RData objects to disk
[2021-12-20 11:37:01] Done saving extra RData objects to disk
[2021-12-20 11:37:01] Done QUILT
[2021-12-20 11:37:01] Processing file 1 out of 3
Error in qualmat2[i, j] <- utf8ToInt(qualmat2[i, j]) :
  replacement has length zero
Calls: QUILT_HLA ... lapply -> FUN -> do_simon_read_stuff_with_that_and_that2
Execution halted

Based off of the error I thought maybe the problem has to do with quality scores, so I looked at a couple reads from NA12878.mhc.2.0X.bam:

A00217:77:HFJWFDSXX:1:1449:20745:25708  83  chr6    25999944    60  150M    =   25999772    -322    TGACCTGCTGACCTTCTCTCCACTATTATCCTTTGACCCTGCCACATCCCCCTCTCCGA
GAAACACCCAATAATGATCAATAAATACTAAGGGAACTCAGAGGCCGGTGGGATCCTCCGTATGCTGAACACCGGTCCCCTGGACCCCTTT ?5????5??????????+??+???????????????????+??????????????+??5+??????+????+???????????
??????5???????????????????????????????????????????????????????????? MC:Z:150M   PG:Z:MarkDuplicates MQ:i:60 AS:i:148    XS:i:117    MD:Z:1T148  NM:i:1  RG:
Z:NA12878_TTGCCTAG-ACCACTTA_HFJWFDSXX_L001
A00217:76:HFLT3DSXX:3:1334:30608:21151  83  chr6    25999962    60  150M    =   25999774    -338    TCCACTATTATCCTTTGACCCTGCCACATCCCCCTCTCCGAGAAACACCCAATAATGAT
CAATAAATACTAAGGGAACTCAGAGGCCGGTGGGATCCTCCGTATGCTGAACACCGGTCCCCTGGACCCCTTTTTTTCTTTCTCTATACTT ???????????????????????????????????????????????????????????+???????????????????????
??????????????????????????????????????????????????????????????????? MC:Z:150M   PG:Z:MarkDuplicates MQ:i:60 AS:i:150    XS:i:112    MD:Z:150    NM:i:0  RG:

and compared them to a couple reads in my bam file:

A00953:441:H25C5DSX3:3:1369:18168:21245 163 chr6    26000915    60  101M    =   26001335    521 CTAAATTCTTAGGATCAACCCCTCTTGCATATGCACATCCGCAGTGCCTGTGGGATACG
GTGTGGTGTGGGGAGGAAACTCTCTTCTGCCTTAATATTTTC  FFFFFFF:FFFFFFFFF,FFFFFFFF,FFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFF,FFFF:FFFFF:FFFFFFFF:FFFFFFF:FFFFFFFFFFF   NM:i:0  MD:Z:101    AS:
i:101   XS:i:0  RG:Z:OXPC014_B_S12_L003 MQ:i:60 ms:i:3737   mc:i:26001435   MC:Z:101M
A00953:441:H25C5DSX3:3:2651:27950:35681 163 chr6    26001323    60  101M    =   26001405    183 CCAAAGTACTGGGATTACAGGCATGAGCCAATGTGCCCGTCCTGAGGAGTTTTCTTCTG
GAATTCCTGCTGGGTTTTTGTAGTCAGTCCTCTCCCCATTTC  FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:101    AS:
i:101   XS:i:37 RG:Z:OXPC014_B_S12_L003 MQ:i:60 ms:i:3701   mc:i:26001505   MC:Z:101M

It looks like the majority of the bases in NA12878.mhc.2.0X.bam are Q30 (Corresponding to ?), while the majority in my bam file are Q37 (Corresponding to F). Could this be causing this error? Will continue debugging this one, and see if I can maybe re-encode the base quality scores in my BAM file and see if that makes a difference.

Thanks!

danielanach commented 2 years ago

Okay I modified the base qualities in my BAM files to be a max of Q30. Which seemed to fix this specific error but then I ran into another problem that looked like it was related to read length... I modified several lines in hla_functions.py that I understood to be specific to the 150bp read length, to reflect my 101bp read length:

Example of a place that appeared to be 150 bp read length specific, but there were a few other lines that had that as well:

readpos=cbind(newkmers[match(substring(seqs,11,20),newkmers[,1]),2],newkmers[match(substring(seqs,21,30),newkmers[,1]),2],newkmers[match(substring(seqs,121,130),newkmers[,1]),2],newkmers[match(substring(seqs,131,140),newkmers[,1]),2])

I was then able to run QUILT-HLA without error, but not a 100% sure my modifications were kosher. Is there a way to modify BAM read length from the script call that I am missing?

Thanks!

rwdavies commented 2 years ago

Hi, thanks for the comments. The 150bp fixed requirement is a suboptimal coding implementation due to being in a rush to finish the project before submitting. I've meant to go back and clean those up but haven't had the chance. From some other open issues, there are some things I'd like to fix in the code base. If and when I get some better test coverage, I'll change the code and then change the requirements around this. The base quality thing I hope is easier to fix, that seems weird.

Glad you could get a hacky version working!

Suuuuuuuus commented 1 month ago

Hey,

Just curious do we have a plan to have the read length issue fixed in the near future? I'm experiencing a similar problem (utf8ToInt) and it does not seem to be the base quality string issue as Daniela put -- I had the same problem after making my Q37 string to be Q30...

Weirdly I could make QUILT-HLA running when I aligned to a version of fasta that does not have HLA contigs, but keep getting the same utf8ToInt issue when I aligned to the reference fasta dowdloaded from the link in the instruction... Is this likely to be a genuine issue caused by 150 bps (my samples are 151 bps paired-end) or might be something else?

Many thanks for your help!

rwdavies commented 1 month ago

Hi,

I'm unlikely to fix the read length issue in the near future. Simon wrote that part of the code, and his schedule is notoriously full, so I'm not sure I'll be able to get him to come back and fix it. I'd welcome a merge request if someone were to fix it (though I realize that might be a big ask).

I imagine a problem is much more likely to like with the read length not being 150bp exactly, rather than the specific quality score, or their distribution (I'd be surprised though wouldn't rule out any QUAL scores from 1 to 60 causing problems)

Best, Robbie

Suuuuuuuus commented 1 month ago

Hi Robbie,

Thanks a lot for getting back! We will also try to find a hacky version :)

One specific problem when reading this bit of code:

readpos=cbind(newkmers[match(substring(seqs,11,20),newkmers[,1]),2],newkmers[match(substring(seqs,21,30),newkmers[,1]),2],newkmers[match(substring(seqs,121,130),newkmers[,1]),2],newkmers[match(substring(seqs,131,140),newkmers[,1]),2])

It seems to be comparing only 10-mers with newkmers but only at 4 pieces (11-20, 21-30, 121-130, 131-140) and not considering the rest. My questions are:

Any information is much appreciated!

Many thanks, Sus

rwdavies commented 1 month ago

Yes exactly. Hopefully Simon described that in his methods writeup in the paper?

Yes that seems reasonable for your case. More generally if you're able to determine the read length and then use the same positions (i.e. exclude 10 bp from the right end, then do the next two 10 bp segments), and submit that code for a merge request, I'd be likely to accept. It's harder for heterogeous read lengths but also ideally if you could write a stop command (or filter) for non-majority read lengths that would be great.

Thanks Robbie

Suuuuuuuus commented 1 month ago

Hi Robbie,

Again many thanks for getting back! After I fixed the read-length issue per our discussion (which now should be able to cope with any read length) I kept getting the same problem. By running the code on a test dataset line-by-line I realised that my issue was due to a separate problem -- some reads in my bam file is soft-clipped, which makes them shorter. When utf8ToInt is invoked to modify the QUAL string it was by chance operated on an empty string, which yields an integer type value 0 instead of a string (and thus cause the problem I get).

To fix this I modified filter_that to make it filter out all reads that are not at the desired read length, which solved the issue on my test dataset. I'm now almost confident that this will work, and will test on my main dataset this week to see if things are now working properly. Upon successful execution I will submit that modified code for a merge request.

Cheers! Sus

rwdavies commented 1 month ago

Fantastic! Excellent sleuthing! My apologies that the QUILT-HLA code base isn't better tested. I look forward to a merge request.

Thanks, Robbie