Open luckysardar opened 1 month ago
Hi, Can I see the first few lines of your VCF ? As well as the chromosome name in your fasta masked file ? Thanks
Thank you for response, here is the masked genome chromosome name and vcf file
chr names
chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX chrY chrMT
vcf file(https://drive.google.com/file/d/1-18ej_EoZpTL4KEvkiukEv-exqz6Mz3G/view?usp=sharing)
chr1 3055820 . T A 5754.96 PASS DP=5080;AD=3968,1110;DP4=2007,1961,556,556;MQ=59;CSQ=A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,C|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,G|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=48;AN=104 GT 0|1 chr1 3055997 . C A 5108.92 PASS DP=5097;AD=3980,1106;DP4=2151,1829,569,548;MQ=59;CSQ=A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765614|CTCF_binding_site||||||||||||||SNV||||||||,A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765615|open_chromatin_region||||||||||||||SNV||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765615|open_chromatin_region||||||||||||||SNV||||||||,T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSMUSR00000765615|open_chromatin_region||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,G|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=41;AN=104 GT 0|1 chr1 3056933 . G T 4964.34 PASS DP=4922;AD=3793,1126;DP4=2018,1775,579,550;MQ=59;CSQ=T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=41;AN=102 GT 0|1 chr1 3057907 . G T 6104.57 PASS DP=5627;AD=4404,1214;DP4=2300,2104,556,667;MQ=59;CSQ=T|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,A|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||,C|intergenic_variant|MODIFIER|||||||||||||||||||SNV||||||||;AC=48;AN=104 GT 0|1
Thank you
Looking at your VCF file header, it seems that the VCF was built using mm39 genome, while you used mm10 to build your N-mask genome.
##bcftoolsCommand=mpileup -f Mus_musculus.GRCm39.dna.toplevel.fa.gz -b samples -g 10 -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/SP,INFO/AD -E -Q 0 -pm 3 -F 0.25 -d 500
and it makes sense with the stats, because all your reads are unassigned
Number of unassigned reads 261883 74.01
It means that for a given SNP, HiC-Pro is exacting the nucleotide at that position in the BAM file, but that the latter is not corresponding to the expected allele ... which match the fact that you are not using the same genome version
Thanks, I think one mistake I have made is when I raised issue initially worked on mm10 later I switch to mm39 based on issue. downloaded latest vcf from mouse genome project and downloaded reference genome with same version from ensembl (Mus_musculus.GRCm39.dna.toplevel.fa (104 version)). Changing genome version and vcf did not resolve my issue
Total number of reads 56614 100 Number of reads with at least one 'N' 21965 38.798 Number of reads assigned to ref genome 8023 14.171 Number of reads assigned to alt genome 0 0.0 Number of conflicting reads 0 0.0 Number of unassigned reads 48591 85.829
Read pairs/singletons processed in total: 28307 thereof were read pairs: 28307 thereof were singletons: 0 Reads were unassignable (not overlapping SNPs): 14537 (51.35%) thereof were read pairs: 14537 thereof were singletons: 0 Reads were specific for genome 1: 6464 (22.84%) thereof were read pairs: 6464 thereof were singletons: 0 Reads were specific for genome 2: 6868 (24.26%) thereof were read pairs: 6868 thereof were singletons: 0 Reads contained conflicting SNP information: 438 (1.55%) thereof were read pairs: 438 thereof were singletons: 0
Dear I am trying to do allelic analysis Hic data getting same error G2 is mapping reads are zero and iced normalisation step getting error. I used bellow command to run analysis. non allelic analysis running completely good . I tried many publically available data with different strain cross like 129S vs CAST and B6 vs PWK got same error I am unable to find solution where its going wrong please help me
bowtie2 allespe.stat file looks like below HiC-Pro_3.1.0/scripts/markAllelicStatus.py bam=bowtie_results/bwt2/Sample1/subset_masked_mm10.bwt2pairs.bam snpFile=mm10_snps_C57b6_PWK_PhJ.vcf tag=XA output=bowtie_results/bwt2/Sample1/subset_masked_mm10.bwt2pairs_allspe.bam verbose=True
Total number of snps loaded 17046154.0
Total number of reads 353850 100 Number of reads with at least one 'N' 368870 104.245 Number of reads assigned to ref genome 91967 25.99 Number of reads assigned to alt genome 0 0.0 Number of conflicting reads 0 0.0 Number of unassigned reads 261883 74.01
code used generate VCF and masked genome
`extract_snps.py -i mgp.v5.merged.snps_all.dbSNP142.vcf -a PWK_PhJ> mm10_snps_C57b6_PWK_PhJ.vcf
bedtools maskfasta -fi mm10.fa -bed mm10_snps_C57b6_PWK_PhJ.vcf -fo masked_mm10.fa
bowtie2-build masked_mm10.fa masked_mm10 --threads 6
HiC-Pro -c config_test_as.txt -i 01_Fastq/ -o 03_output2`
please suggest me where I am going wrong
Thank you