secastel / phaser

phasing and Allele Specific Expression from RNA-seq
GNU General Public License v3.0
107 stars 37 forks source link

no alignment score value found in reads #58

Open molecularsensei opened 4 years ago

molecularsensei commented 4 years ago

Hi,

I am trying to run phaser using HiC data on a ~3 Mb region and ran into the "no alignment score value found" error.

The command I am using:

python phaser.py --vcf ~/GT_myc_output.recalibrated.filtered.vcf.gz --bam ~/SRR6251266_chr8pairs_only_bwa_sorted.bam --paired_end 1 --o ~/phaser_case --sample 20 --mapq 60 --baseq 20

(I have tried restricting the interval --chr chr8 but get the same error)

The output:

STARTED "Read backed phasing and ASE/haplotype analyses" ... DATE, TIME : 2019-11-07, 10:22:17

1. Loading heterozygous variants into intervals...

Processing sample named 20 using all the chromosomes ... processing VCF...

Memory efficient mode is deactivated...
If RAM is limited, activate memory efficient mode using the flag "--process_slow = 1"...

 creating variant mapping table...
      1059 heterozygous sites being used for phasing (1243 filtered, 0 indels excluded, 988 unphased)

2. Retrieving reads that overlap heterozygous sites...

 file: ~/SRR6251266_chr8pairs_only_bwa_sorted.bam
      minimum mapq: 10
      mapping reads to variants...
           completed chromosome chr8...
      processing mapped reads...
      no alignment score value found in reads, cannot use cutoff
      retrieved 0 reads

3. Identifying connected variants...

 calculating sequencing noise level...
 FATAL ERROR: No reads could be matched to variants. Please double check your settings and input files. Common reasons for this occurring include: 1) MAPQ or BASEQ set too conservatively 2) BAM and VCF have different chromosome names (IE 'chr1' vs '1').

After inspecting the BAM I can see that column 5 has the correct MAPQ scores. Example:

SRR6251266.247070450 81 chr8 10025 54 42M = 10860 795 CAGTGCAGACTGATATATAAATCAAAACAAATGTCCTTTACA AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEAAAAA NM:i:0 MD:Z:42 MC:Z:36M6S AS:i:42 XS:i:33 SRR6251266.167696392 97 chr8 10051 60 42M = 149413 139404 ACAAATGTCCTTTACATGTTTTCTGTTACAGTAGTAACAATA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:29 SRR6251266.100942691 97 chr8 10052 60 42M = 18805 8795 CAAATGTCCTTTACATGTTTTCTGTTACAGTAGTAACAATAT AAAAAEE/AEEEEEE/EEEEEEAEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:28 SRR6251266.173849305 97 chr8 10056 60 42M = 77718 67697 TGTCCTTTACATGTTTTCTGTTACAGTAGTAACAATATGTGT /AAAAEEEEEEEEEEEEEAEEEAEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:7S35M AS:i:42 XS:i:28 SRR6251266.217291158 177 chr8 10057 60 42M = 162736 152680 GTCCTTTACATGTTTTCTGTTACAGTAGTAACAATATGTGTA EAEEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEA6AAA NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:29 SRR6251266.119745571 161 chr8 10063 39 42M = 11197 1176 TACATGTTTTCTGTTACAGTAGTAACAATATGTGTAAACTTA AAAAAEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:35 XA:Z:chr4,+21281,27M1D15M,1;

I would appreciate any help :)

secastel commented 4 years ago

You can fix this error by including "--as_q_cutoff 0" as an argument when running phaser.

Note, phaser performs much better when an alignment score cutoff is used. Note that this is distinct from a mapping score or MAPQ. If possible I would suggest using an aligner that outputs an alignment score, which is included as an "AS" tag, such as STAR.

molecularsensei commented 4 years ago

Thanks for getting back. I used bwa mem for alignment and the AS tag is present in the bam file.

SRR6251266.167696392 97 chr8 10051 60 42M = 149413 139404 ACAAATGTCCTTTACATGTTTTCTGTTACAGTAGTAACAATA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:29

Is this not the AS tag I should look for?

I also tried using the --as_q_cutoff 0 argument but get an error: .... minimum mapq: 60 mapping reads to variants... completed chromosome chr8... processing mapped reads... retrieved 0 reads

3. Identifying connected variants...

 calculating sequencing noise level...
 FATAL ERROR: No reads could be matched to variants. Please double check your settings and input files. Common reasons for this occurring include: 1) MAPQ or BASEQ set too conservatively 2) BAM and VCF have different chromosome names (IE 'chr1' vs '1')

....

I will try mapping with STAR and see if things change.

secastel commented 4 years ago

You're right that the AS tag is present in the BAM, so I'm confused as to the why the AS score wasn't getting picked up. As for the fact that 0 reads are getting processed, can you confirm that your VCF naming matches the BAM naming, e.g. "chr1" is listed in the first column of the VCF. If the VCF simply has "1" you can run phaser with the option "--chr_prefix chr" which should fix that.

As for the AS problem, is there any way you could provide a sample BAM so that I could debug the issue you are having? Were you able to try with STAR aligned reads?

bmansfeld commented 3 years ago

For other people who might still be getting this using GATK for variant calling. The GATK4 RNASeq pipeline recommends a CIGAR split step and I think that is causing some issues with the bam files used in Phaser. If I used the STAR aligned, deduplicated but not CIGAR N split bam file with my GATK4 derived vcf I was able to use Phaser without getting this error. Ben

teng-gao commented 1 year ago

Getting this error as well, pretty mysteriously for only 1 out of my BAMs. Edit: actually, turns out my error is because the BAM is not paired end and I set paired_end = 1