Open danielanach opened 2 years ago
Okay I modified the base qualities in my BAM files to be a max of Q30. Which seemed to fix this specific error but then I ran into another problem that looked like it was related to read length... I modified several lines in hla_functions.py
that I understood to be specific to the 150bp read length, to reflect my 101bp read length:
Example of a place that appeared to be 150 bp read length specific, but there were a few other lines that had that as well:
readpos=cbind(newkmers[match(substring(seqs,11,20),newkmers[,1]),2],newkmers[match(substring(seqs,21,30),newkmers[,1]),2],newkmers[match(substring(seqs,121,130),newkmers[,1]),2],newkmers[match(substring(seqs,131,140),newkmers[,1]),2])
I was then able to run QUILT-HLA without error, but not a 100% sure my modifications were kosher. Is there a way to modify BAM read length from the script call that I am missing?
Thanks!
Hi, thanks for the comments. The 150bp fixed requirement is a suboptimal coding implementation due to being in a rush to finish the project before submitting. I've meant to go back and clean those up but haven't had the chance. From some other open issues, there are some things I'd like to fix in the code base. If and when I get some better test coverage, I'll change the code and then change the requirements around this. The base quality thing I hope is easier to fix, that seems weird.
Glad you could get a hacky version working!
Hey,
Just curious do we have a plan to have the read length issue fixed in the near future? I'm experiencing a similar problem (utf8ToInt) and it does not seem to be the base quality string issue as Daniela put -- I had the same problem after making my Q37 string to be Q30...
Weirdly I could make QUILT-HLA running when I aligned to a version of fasta that does not have HLA contigs, but keep getting the same utf8ToInt issue when I aligned to the reference fasta dowdloaded from the link in the instruction... Is this likely to be a genuine issue caused by 150 bps (my samples are 151 bps paired-end) or might be something else?
Many thanks for your help!
Hi,
I'm unlikely to fix the read length issue in the near future. Simon wrote that part of the code, and his schedule is notoriously full, so I'm not sure I'll be able to get him to come back and fix it. I'd welcome a merge request if someone were to fix it (though I realize that might be a big ask).
I imagine a problem is much more likely to like with the read length not being 150bp exactly, rather than the specific quality score, or their distribution (I'd be surprised though wouldn't rule out any QUAL scores from 1 to 60 causing problems)
Best, Robbie
Hi Robbie,
Thanks a lot for getting back! We will also try to find a hacky version :)
One specific problem when reading this bit of code:
readpos=cbind(newkmers[match(substring(seqs,11,20),newkmers[,1]),2],newkmers[match(substring(seqs,21,30),newkmers[,1]),2],newkmers[match(substring(seqs,121,130),newkmers[,1]),2],newkmers[match(substring(seqs,131,140),newkmers[,1]),2])
It seems to be comparing only 10-mers with newkmers
but only at 4 pieces (11-20, 21-30, 121-130, 131-140) and not considering the rest. My questions are:
Any information is much appreciated!
Many thanks, Sus
Yes exactly. Hopefully Simon described that in his methods writeup in the paper?
Yes that seems reasonable for your case. More generally if you're able to determine the read length and then use the same positions (i.e. exclude 10 bp from the right end, then do the next two 10 bp segments), and submit that code for a merge request, I'd be likely to accept. It's harder for heterogeous read lengths but also ideally if you could write a stop command (or filter) for non-majority read lengths that would be great.
Thanks Robbie
Hi Robbie,
Again many thanks for getting back! After I fixed the read-length issue per our discussion (which now should be able to cope with any read length) I kept getting the same problem. By running the code on a test dataset line-by-line I realised that my issue was due to a separate problem -- some reads in my bam file is soft-clipped, which makes them shorter. When utf8ToInt
is invoked to modify the QUAL string it was by chance operated on an empty string, which yields an integer type value 0 instead of a string (and thus cause the problem I get).
To fix this I modified filter_that
to make it filter out all reads that are not at the desired read length, which solved the issue on my test dataset. I'm now almost confident that this will work, and will test on my main dataset this week to see if things are now working properly. Upon successful execution I will submit that modified code for a merge request.
Cheers! Sus
Fantastic! Excellent sleuthing! My apologies that the QUILT-HLA code base isn't better tested. I look forward to a merge request.
Thanks, Robbie
Hello! One more problem I am running into.
QUILT-HLA is working great on the test BAM files but I have been having some troubles with my BAM files.
Here I am taking the same command that worked for the test BAM files downloaded and provided in:
QUILT_hla_reference_panel_construction.Md
and used some lpWGS BAM files that had been aligned with hg38.Based off of the error I thought maybe the problem has to do with quality scores, so I looked at a couple reads from
NA12878.mhc.2.0X.bam
:and compared them to a couple reads in
my bam file
:It looks like the majority of the bases in
NA12878.mhc.2.0X.bam
are Q30 (Corresponding to?
), while the majority inmy bam file
are Q37 (Corresponding toF
). Could this be causing this error? Will continue debugging this one, and see if I can maybe re-encode the base quality scores in my BAM file and see if that makes a difference.Thanks!