1. Test data set:
Here are my final phase stats (with the small test, i.e. fastq downloaded with --minSpotId 0 --maxSpotId 1000000):
# cat phased.XA.dedup.stats | head -8
total 1429506
total_unmapped 203628
total_single_sided_mapped 278449
total_mapped 947429
total_dups 60249
total_nodups 887180
cis 152576
trans 734604
# cat phased.XA.phase0.stats | head -8
total 3704
total_unmapped 0
total_single_sided_mapped 0
total_mapped 3704
total_dups 257
total_nodups 3447
cis 3304
trans 143
# cat phased.XA.phase1.stats | head -8
total 3102
total_unmapped 0
total_single_sided_mapped 0
total_mapped 3102
total_dups 279
total_nodups 2823
cis 2389
trans 434
# cat phased.XA.unphased.stats | head -8
total 1203742
total_unmapped 0
total_single_sided_mapped 264503
total_mapped 939239
total_dups 59712
total_nodups 879527
cis 125709
trans 753818
# cat phased.XA.trans-phase.stats | head -8
total 1384
total_unmapped 0
total_single_sided_mapped 0
total_mapped 1384
total_dups 1
total_nodups 1383
cis 985
trans 398
I don't get why I don't have the same stats than in your tutorial, but the ratio of cis in phase0 / cis total is about the same range, I have here 2.5% (3304/(125709+985+2389+3304)) of cis in [hase0, in your example you have 1% (535/(52294+177+394+535)).
When I run the same example with the full fastq files (stll with SRR8811373), here are my final stats:
# cat phased.XA.dedup.stats | head -35
total 6923823
total_unmapped 953932
total_single_sided_mapped 1401139
total_mapped 4568752
total_dups 816275
total_nodups 3752477
cis 3045355
trans 707122
# cat phased.XA.phase0.stats | head -20
total 17866
total_unmapped 0
total_single_sided_mapped 0
total_mapped 17866
total_dups 4639
total_nodups 13227
cis 12685
trans 542
# cat phased.XA.phase1.stats | head -20
total 14771
total_unmapped 0
total_single_sided_mapped 0
total_mapped 14771
total_dups 4201
total_nodups 10570
cis 9202
trans 1368
# cat phased.XA.trans-phase.stats | head -20
total 6848
total_unmapped 0
total_single_sided_mapped 0
total_mapped 6848
total_dups 27
total_nodups 6821
cis 4867
trans 1954
# cat phased.XA.unphased.stats | head -20
total 5857878
total_unmapped 0
total_single_sided_mapped 1328611
total_mapped 4529267
total_dups 807408
total_nodups 3721859
cis 2529823
trans 1192036
Here the ratio of cis in phase0 / cis total is 0.4% (12685/(2529823+4867+12685+9202))
Do you know why the ratio of contact mapped on phase0 haplotype drops when I use the full dataset? do you have the same? I was expecting this ratio to be the same independently of using a full dataset or a sample.
Then here the ratio of 1-1 over total (= phase0 in pairtools phase) is 6% (82362/1209391)
I guess numbers of aligned reads, detected pairs and cis-contact will vary depending on quality threshold applied and algorithm used, but I don't get why it is so different between HiCPro output and pairtools phase on the full fastq file.
Additional questions:
Can I combine pairtools phase with pairtools dedub by restriction fragment?
Can I combine pairtools phase with filterByCov?
Something like
pairtools parse -> pairtools phase -> pairtools sort -> pairtools dedup amplification duplicates -> pairtools restrict -> pairtools dedup by restriction fragment -> pairtools filterbycov -> pairtools select by phase
Thanks you for your help!
David
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1198-dirty
pairtools, version 1.0.2
Hi Open2C team,
I'd like to use pairtools phase and I have some issues running it. I'd like to use it in a first time to reanalyse snHiC data from Collombet et al.,2020. following this https://pairtools.readthedocs.io/en/latest/examples/pairtools_phase_walkthrough.html
1. Test data set: Here are my final phase stats (with the small test, i.e. fastq downloaded with --minSpotId 0 --maxSpotId 1000000):
I don't get why I don't have the same stats than in your tutorial, but the ratio of cis in phase0 / cis total is about the same range, I have here 2.5% (3304/(125709+985+2389+3304)) of cis in [hase0, in your example you have 1% (535/(52294+177+394+535)).
When I run the same example with the full fastq files (stll with SRR8811373), here are my final stats:
Here the ratio of cis in phase0 / cis total is 0.4% (12685/(2529823+4867+12685+9202))
Do you know why the ratio of contact mapped on phase0 haplotype drops when I use the full dataset? do you have the same? I was expecting this ratio to be the same independently of using a full dataset or a sample.
2. Comparison with processed data:
So I compared with the number or ratio of pairs that they report in the original dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129029)
Then here the ratio of 1-1 over total (= phase0 in pairtools phase) is 6% (82362/1209391) I guess numbers of aligned reads, detected pairs and cis-contact will vary depending on quality threshold applied and algorithm used, but I don't get why it is so different between HiCPro output and pairtools phase on the full fastq file.
Additional questions:
Can I combine pairtools phase with pairtools dedub by restriction fragment? Can I combine pairtools phase with filterByCov? Something like pairtools parse -> pairtools phase -> pairtools sort -> pairtools dedup amplification duplicates -> pairtools restrict -> pairtools dedup by restriction fragment -> pairtools filterbycov -> pairtools select by phase
Thanks you for your help!
David