Allele specific analysis - assignment failure

PhrenoVermouth commented 1 year ago

Hi, I'm doing an allele-specific analysis, the pipeline went well until the final ice-normalization step. Tracing backwards I found the raw matrics were empty, and markAllelicStatus aberrantly assigned the reads. See .allelstat file as follows:

## /usr/local/bin//HiC-Pro_3.1.0/scripts/markAllelicStatus.py
## ibam=bowtie_results/bwt2/E9-5-PLA-rep1/E9-5-PLA-rep1-H3K27ac_S15_L002_cut_mm10.bwt2pairs.bam
## snpFile=/home3/gyang/2.ProjectET_HiCUT_Preprocessing/build/snps_C57b6_PWK_PhJ.vcf
## tag=XA
## output=bowtie_results/bwt2/E9-5-PLA-rep1/E9-5-PLA-rep1-H3K27ac_S15_L002_cut_mm10.bwt2pairs_allspe.bam
## verbose=True
## =========================
Total number of snps loaded 17046154.0
## =========================
Total number of reads   91295716    100
Number of reads with at least one 'N'   85300346    93.433
Number of reads assigned to ref genome  24111100    26.41
Number of reads assigned to alt genome  0   0.0
Number of conflicting reads 0   0.0
Number of unassigned reads  67184616    73.59

The weird thing is that when I fed the bam (E9-5-PLA-rep1-H3K27ac_S15_L002_cut_mm10.bwt2pairs.bam) to SNPsplit to verify the data, it went pretty well:

Allele-tagging report
=====================
Processed 91295716 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
46949809 reads were unassignable (51.43%)
26001038 reads were specific for genome 1 (28.48%)
18286300 reads were specific for genome 2 (20.03%)
46452 reads did not contain one of the expected bases at known SNP positions (0.05%)
58569 contained conflicting allele-specific SNPs (0.06%)

Thus this problem was not caused by the data. For VCF file generation, I stringently followed the utils introduction:

$HICPRO_PATH/bin/utils/extract_snps.py -i mgp.v5.merged.snps_all.dbSNP142.vcf -a PWK_PhJ > snps_C57b6_PWK_PhJ.vcf

and I've checked the vcf file by hand, everything seems normal:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT REF-PWK_PhJ-F1
chr1    3000185 rs585444580     G       T                               999     PASS    DP=930;DP4=283,108,361,178;CSQ=T||||intergenic_variant||||||||  GT
      0|1

One clue is that I got a HUGE markAllelicStatus.log file (>2G), full of warnings (more lines omitted below):

/home/gyang/anaconda3/bin/python /usr/local/bin//HiC-Pro_3.1.0/scripts/markAllelicStatus.py -s /home3/gyang/2.ProjectET_HiCUT_Preprocessing/build/snps_C57b6_PW
K_PhJ.vcf -v -r -i bowtie_results/bwt2/E9-5-PLA-rep1/E9-5-PLA-rep1-H3K27ac_S15_L002_cut_mm10.bwt2pairs.bam -o bowtie_results/bwt2/E9-5-PLA-rep1/E9-5-PLA-rep1-H
3K27ac_S15_L002_cut_mm10.bwt2pairs_allspe.bam
[E::idx_find_and_load] Could not retrieve index file for 'bowtie_results/bwt2/E9-5-PLA-rep1/E9-5-PLA-rep1-H3K27ac_S15_L002_cut_mm10.bwt2pairs.bam'
Warning : no SNPs found at position chr8 : 11389358. N ignored
Warning : no SNPs found at position chr8 : 11389358. N ignored
Warning : no SNPs found at position chr14 : 49493363. N ignored
Warning : no SNPs found at position chr14 : 49493363. N ignored
Warning : no SNPs found at position chr5 : 73611472. N ignored
Warning : no SNPs found at position chr5 : 73611511. N ignored

Your assistance is highly appreciated! And currently, I have to use SNPsplit to tag the reads and sed command for tag substitutions (XX --> XA). Hope that works. LOL

harshadajadhav95 commented 4 days ago

I'm facing the same problem even after changing extract_snps.py as suggested above. Kindly help

PhrenoVermouth commented 4 days ago

I'm facing the same problem even after changing extract_snps.py as suggested above. Kindly help

Hey @harshadajadhav95, it has been quite a while, maybe you can refer to #606?

Best, Guang

harshadajadhav95 commented 4 days ago

Thanks for your response. I referred #606. However, I am still encountering the same issue.

Regards, Harshada

WardDeb commented 3 days ago

You can check out either devel branch, or clone from my fork (https://github.com/wardDeb/hic-pro). Been a while since I've been looking into this, though, so I'm not so sure what happened in the meantime.

harshadajadhav95 commented 2 days ago

Thank you. I'll do the needful

nservant / HiC-Pro

Allele specific analysis - assignment failure #572