nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
382 stars 183 forks source link

Allele specific analysis - assignment failure #572

Open PhrenoVermouth opened 1 year ago

PhrenoVermouth commented 1 year ago

Hi, I'm doing an allele-specific analysis, the pipeline went well until the final ice-normalization step. Tracing backwards I found the raw matrics were empty, and markAllelicStatus aberrantly assigned the reads. See .allelstat file as follows:

## /usr/local/bin//HiC-Pro_3.1.0/scripts/markAllelicStatus.py
## ibam=bowtie_results/bwt2/E9-5-PLA-rep1/E9-5-PLA-rep1-H3K27ac_S15_L002_cut_mm10.bwt2pairs.bam
## snpFile=/home3/gyang/2.ProjectET_HiCUT_Preprocessing/build/snps_C57b6_PWK_PhJ.vcf
## tag=XA
## output=bowtie_results/bwt2/E9-5-PLA-rep1/E9-5-PLA-rep1-H3K27ac_S15_L002_cut_mm10.bwt2pairs_allspe.bam
## verbose=True
## =========================
Total number of snps loaded 17046154.0
## =========================
Total number of reads   91295716    100
Number of reads with at least one 'N'   85300346    93.433
Number of reads assigned to ref genome  24111100    26.41
Number of reads assigned to alt genome  0   0.0
Number of conflicting reads 0   0.0
Number of unassigned reads  67184616    73.59

The weird thing is that when I fed the bam (E9-5-PLA-rep1-H3K27ac_S15_L002_cut_mm10.bwt2pairs.bam) to SNPsplit to verify the data, it went pretty well:

Allele-tagging report
=====================
Processed 91295716 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
46949809 reads were unassignable (51.43%)
26001038 reads were specific for genome 1 (28.48%)
18286300 reads were specific for genome 2 (20.03%)
46452 reads did not contain one of the expected bases at known SNP positions (0.05%)
58569 contained conflicting allele-specific SNPs (0.06%)

Thus this problem was not caused by the data. For VCF file generation, I stringently followed the utils introduction:

$HICPRO_PATH/bin/utils/extract_snps.py -i mgp.v5.merged.snps_all.dbSNP142.vcf -a PWK_PhJ > snps_C57b6_PWK_PhJ.vcf

and I've checked the vcf file by hand, everything seems normal:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT REF-PWK_PhJ-F1
chr1    3000185 rs585444580     G       T                               999     PASS    DP=930;DP4=283,108,361,178;CSQ=T||||intergenic_variant||||||||  GT
      0|1

One clue is that I got a HUGE markAllelicStatus.log file (>2G), full of warnings (more lines omitted below):

/home/gyang/anaconda3/bin/python /usr/local/bin//HiC-Pro_3.1.0/scripts/markAllelicStatus.py -s /home3/gyang/2.ProjectET_HiCUT_Preprocessing/build/snps_C57b6_PW
K_PhJ.vcf -v -r -i bowtie_results/bwt2/E9-5-PLA-rep1/E9-5-PLA-rep1-H3K27ac_S15_L002_cut_mm10.bwt2pairs.bam -o bowtie_results/bwt2/E9-5-PLA-rep1/E9-5-PLA-rep1-H
3K27ac_S15_L002_cut_mm10.bwt2pairs_allspe.bam
[E::idx_find_and_load] Could not retrieve index file for 'bowtie_results/bwt2/E9-5-PLA-rep1/E9-5-PLA-rep1-H3K27ac_S15_L002_cut_mm10.bwt2pairs.bam'
Warning : no SNPs found at position chr8 : 11389358. N ignored
Warning : no SNPs found at position chr8 : 11389358. N ignored
Warning : no SNPs found at position chr14 : 49493363. N ignored
Warning : no SNPs found at position chr14 : 49493363. N ignored
Warning : no SNPs found at position chr5 : 73611472. N ignored
Warning : no SNPs found at position chr5 : 73611511. N ignored

Your assistance is highly appreciated! And currently, I have to use SNPsplit to tag the reads and sed command for tag substitutions (XX --> XA). Hope that works. LOL