Open xfengnefx opened 1 month ago
Let me know if you have any issue replicating the task.
Sorry I think I misunderstood what you mean by question 1. The notebook do not process the intermediate results, it compares the CSVs to get the value of your 2nd question. To run whatshap compare you need to run the meth_phaser_post_processing
script instead of the jupyter notebook. Sorry for the confusion of the file name.
Thanks for the reply! For question 1, I did run the meth_phaser_post_processing
script, like so:
meth_phaser_parallel -b haplotagged.bam \
-g phase.vcf.gtf \
-r ref.fasta \
-vc phased.vcf.gz \
-o work
meth_phaser_post_processing \
-ib haplotagged.bam \
-if work \
-ov output.vcf \
-ob output \
-vc phased.vcf.gz
where haplotagged.bam and phased.vcf.* is the outputs from whatshap or hapcut2.
Yet the output vcf has the same number of unique phase block IDs as the input vcf (from whatshap). This was HG002 chr6, and methphaser definitely joined many blocks in both the csv file and the bam output. Is there anything I should sanity check on? Am I understanding the phase block IDs wrong? I assume whatshap compare
would use them, so blocks that joined together need to share an ID.
Yes this is weird because the script does merge the blocks when it reads the csv files from the intermediate output. Phase block IDs are just the start locations of the phaseblocks, which should be reduced after merging. One quick way to check is to see if the haplotype assignment (1|0 or 0|1) is flipped in some blocks, which could help you make sure the script is modifying the vcf file.
edit: after a while I realized the csv's "same" and "not same" is relative to the whatshap vcf, not absolute. Some of the correct/wrong phasing said about the intermediate csv below might be incorrect, please disregard them.
The vcf the bam are being modified, but seems not in all places that need modification, in terms of consistency with the intermediate csv...? I wonder what am doing wrong. The following are from these files:
HG002 with hg38 as reference and 1-index. This joining is all correct:
chr6:23,954,211-24,015,472
But for the following gap:
chr6:11,092,382-11,147,866
Here the intermediate csv has the correct decision, but vcf and bam did not flip the haplotags of the reads...? The left and the right variants in whatshap vcf:
chr6 11092382 . G C 28.14 PASS F;ANN=C|upstream_gene_variant|MODIFIER|SMIM13|SMIM13|transcript|NM_001135575.2|protein_coding||c.-1932G>C|||||1452|,C|intergenic_re
gion|MODIFIER|ELOVL2-AS1-SMIM13|ELOVL2-AS1-SMIM13|intergenic_region|ELOVL2-AS1-SMIM13|||n.11092382G>C|||||| GT:GQ:DP:AD:AF:PS 1|0:28:101:51,48:0.4752:125000
chr6 11147866 . C T 12.56 PASS F;ANN=T|intergenic_region|MODIFIER|SMIM13-NEDD9|SMIM13-NEDD9|intergenic_region|SMIM13-NEDD9|||n.11147866C>T|||||| GT:GQ:DP:AD
:AF:PS 1|0:12:89:51,38:0.427:11147866
For chr6:42318581-42403518, the decision seems not right and phase group ID is not updated:
In IGV:
(^ top track is 1Mb subsequences of T2T HG002 aligned to ref with minimap2 -c -x asm5
)
Another maybe unrelated issue is seen at chr6:43,291,981-43,347,903 , which is correctly assigned & modified as a not-same joint. The intermediate csv's line is "(42403518, 43291981)","(43347903, 44366651)","(42319150, 43347903)","(43291981, 44420831)",not same,16,86,381831,1009123,412,1127,138569,1252385,130,1409
. However, the 1st SNP on the right, chr6:43,347,903
, does not have its haplotag flipped. The 2nd SNP (43369580) and afterwards are all correctly flipped.
I did not retain the stdout or stderr for this run, but I think the only warnings were:
/micromamba/envs/methphaser-env/bin/meth_phaser_parallel:241: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
phased_df_chr.get_group(chromosome).iterrows()
/micromamba/envs/methphaser-env/bin/methphasing:7: DeprecationWarning:pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
from pkg_resources import require
did you use the postprocessing script in your pull request version? I think that actually ignored the block merge I did.
I did a clean conda installation, but still got the same symptoms somehow. On a r10.4.1 e8.2 HG002 chr6, whatshap's phasing had 62 unique phase block IDs, post methphaser phasing it becomes 58. Here are the steps of this run:
Variant calling and SNP phasing files are reused from older runs of the pipeline via softlinks. It was epi2me-labs/wf-human-variation (calling clair3) and whatshap. Remove conda env of previous methphaser, then micromamba clean --all
followed by create new env and micromamba install -c conda-forge -c bioconda pysam=0.22.0 python methphaser line-profiler
(pinning pysam because my note says otherwise a version 0.8 would be grabbed).
Run the following bash script:
conda activate methphaser-clean
printf '=== conda list\n' > log
conda list >> log
printf '\n\n=== checkusms\n' >> log
md5sum /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bin/meth_phaser_parallel >> l
og
md5sum /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bin/methphasing >> log
md5sum /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bin/meth_phaser_post_processi
ng >> log
printf '\n\n=== file directory at the start\n' >> log
ls -lah >> log
printf '\n\n=== methphaser parallel\n' >> log
meth_phaser_parallel -b haplotagged.F4F256F2048.bam \
-g phased.vcf.gtf \
-r /media/groups/machine_learning/active/xfeng/ref/hg38.fa \
-vc phased.vcf.gz \
-o work 1>>log 2>&1
printf '\n\n=== methphaser post porcessing\n' >> log
meth_phaser_post_processing \
-ib haplotagged.F4F256F2048.bam \
-if work \
-ov output.vcf \
-ob output \
-vc phased.vcf.gz 1>>log 2>&1
printf '\n\n=== current directory\n' >> log
ls -lah >> log
Upon finish, the log file is: log.txt. Checksums from the log file are:
a318f6a578b8351a825e1b52c1b92a98 /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bi
n/meth_phaser_parallel
2e316e7dd0c2758c21c45d28b02268ae /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bi
n/methphasing
6f4fb3e0a1f61843cbedf9839ae0f53d /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bi
n/meth_phaser_post_processing
which should be identical with commit 0e44165, tag V0.0.3.
Outputs:
Parsing of phase block IDs:
def load_variants(fn, group_override={}):
groups = {}
with open(fn) as file:
for line in file:
if line[0]=='#': continue
line = line.strip().split('\t')
if len(line[3])!=1 or len(line[4])!=1: continue
if line[3] not in 'ATCG' or line[4] not in 'ATCG': continue
tags = line[-2].split(':')
values = line[-1].split(':')
if 'GT' not in tags: continue
if 'PS' not in tags: continue
genotype = values[tags.index('GT')]
group = values[tags.index('PS')]
if '|' not in genotype: continue
try:
group = int(group)
except:
continue
if group in group_override:
group = group_override[group]
if genotype[0]==genotype[2]: continue
pos = int(line[1])
if group not in groups:
groups[group] = {}
groups[group][pos] = genotype
return groups
tmp = load_variants('../mp/troubleshoot_methphaser_blockID_again/phased.vcf')
print(len(tmp)) # 62
tmp = load_variants('../mp/troubleshoot_methphaser_blockID_again/output.vcf')
print(len(tmp)) # 58
Do you by any chance still have the hapcut2 phased vcf files of the runs described in the paper? I'm using the intermediate csv files at the moment and would like to make sure my parsing is consistent with what would be obtained in the correct vcf.
https://zenodo.org/records/11195009 This one I think
I was aware of the zenodo link, but it did not contain the vcf produced by hapcut2 prior to methphase. I would like to see how many gaps were there for chr6 and run whatshap compare
on both vcf. Are they available anywhere?
Putting that aside, I try to understand the block relationships using R9_60X chr6 methphaser-tagged vcf and the intermediate csv files:
1,"(157644, 253570)","(382537, 4572100)","(157644, 382537)","(253570, 4615421)",same,same,247,143,3884744,4121821,4733,6306,4939375,3067190,4947,6092
which I think suggests that the two blocks are joined. 2,"(382537, 4572100)","(4615421, 6028721)","(253570, 4615421)","(4572100, 6097430)",same,same,21,11,256483,125785,195,116,295698,86570,233,78
joining the two. My understanding is here SNPs with ID 382537 and ID 4615421 should be reassigned to have ID 157644. But this seems to be not the case. What did I get wrong? Thanks!
There are some other filters to decide whether two blocks need to be connected. For example, these two blocks actually don't have a clear majority of read that supports the relationship (247 VS 143 and 21 VS 11). In this case, to make sure of MethPhaser's accuracy, we actually did not take the block assignment into account in our final vcf output.
See these 2 parameters:
-mc , --minimum_coverage Minimum read number to assign blocks' relationship. default: 0. Recommanded setting: mc = coverage/(autosome-wide block number/1000), for details please see the paper -vd , --voting_difference minimum voting difference for relationship assignment, default=0.5
I assume you mean the ratio between column "same_hap_num" and "diff_hap_num". Still using R9_60X chr6, pairs of phase blocks appear to be clear cuts also do not have all of their SNPs' group ID merged. Example of clear cut cases:
left=42473036, has 494 SNPs; right=43347903, has 6 SNPs; ratio is 0.00(99, 0)
35_43.csv:6,"(42473036, 43291981)","(43347903, 43405251)","(42318581, 43347903)","(43291981, 434
66484)",same,same,99,0,869687,837525,451,525,1707212,0,976,0
left=39323058, has 103 SNPs; right=39706505, has 1266 SNPs; ratio is 0.01(67, 1)
35_43.csv:2,"(39323058, 39658781)","(39706505, 41058383)","(39264926, 39706505)","(39658781, 411
15669)",same,same,67,1,457150,483654,313,388,939746,1058,698,3
left=157317226, has 1100 SNPs; right=158667958, has 3659 SNPs; ratio is 0.04(49, 2)
left=19106459, has 745 SNPs; right=20143505, has 262 SNPs; ratio is 0.06(129, 8)
left=37386219, has 627 SNPs; right=38322308, has 1079 SNPs; ratio is 0.06(78, 5)
left=4615421, has 1539 SNPs; right=6097430, has 3261 SNPs; ratio is 0.07(105, 7)
left=82746634, has 1 SNPs; right=83301314, has 12 SNPs; ratio is 0.07(9, 132)
...
Parsing code:
tmp = load_variants('../mp/run_methphaser_paper_chr6/hapcut2_v1.3.1/methphaser_zenodo/R9_60X/tmp_R9.60X.methtagged.chr6.vcf') # see above
d = '../mp/run_methphaser_paper_chr6/hapcut2_v1.3.1/methphaser_zenodo/R9_60X/chr6/'
ratios = []
for fn in os.listdir(d):
if not fn.endswith('csv'): continue
with open(d+fn) as file:
for line in file:
if line[0]==',': continue
if 'decide' in line: continue
line = line.replace('"(', '').replace(')"', '').replace(', ', '-').strip().split(',')
start = int(line[1].split('-')[0])
end = int(line[2].split('-')[0])
joint1, joint2 = line[5], line[6]
s1, s2 = int(line[7]), int(line[8])
ratio = min([s1, s2])/max([s1, s2])
if start in tmp and end in tmp:
ratios.append([ratio, start, end, s1, s2])
ratios.sort(key=lambda x:x[0])
for ratio, start, end, s1, s2 in ratios:
print(f'left={start}, has {len(tmp[start])} SNPs; right={end}, has {len(tmp[end])} SNPs; '+'ratio is %.2f'%ratio+f'({s1}, {s2})')
I vaguely recall that there might be some filter based on CpG numbers in each block. But I need to check further
Could you check if vcf is updated correctly, rather than what filtering criteria were used? I just would like to make sure this comment above is an expected result, and I'm free to take the vcf output as the final result & run whatshap compare on it for evaluation.
yes please
yes please
Just to clarify - do you mean given linked comment above, what I thought was vcf writing error is actually the expected behavior, i.e. when intermediate csv and vcf/bam disagree, I should disregard the csv?
yes because the the postprocessing script is basically applying filters to the csv files and alter the VCFs. if you don't see the change in the vcf it means the line in that csv file didn't pass some filter.
Got it, thank you so much! Closing this. I will test that PR later this week and update over there.
Sorry, reopening for an additional question:
In supplementary table 1, sample R9 60x, "SNV Phasing Unconnected Phaseblock Number" is 3179, "MethPhaser Connected Phaseblock Number" is 1457. I think this means in the vcf produced by methphaser, there should be 3178-1457=1721 phase blocks. In the zenodo data, R9.60X.methtagged.vcf
has 3183 unique phase block IDs, which seems to be almost the original amount. 708 of them has 1 heterozygous phased variant, 181 has 2, and 97 has3.
My local runs are like this, including the clean installation one mentioned above. Are you sure this is correct...?
edit: to be clear, I am not questioning methphaser's performance. By parsing the intermediate csv files & evaluated on them with custom scripts, methphaser was very nice on a local R9 HG002 chr6 run. I just wonder if phase group IDs are not updated either accidentally or intentionally in the final vcf, or I misinterpreted anything.
Sancheck, the vcf matches its md5 on zenodo (this file):
$ md5sum R9.60X.methtagged.vcf
b510c422168dcdffc5a02f97608cd8b8 R9.60X.methtagged.vcf
Here is the script for reproduce the stats:
def get_phased_variant_group_id(file):
for line in file:
if line[0]!='#':
line =line.strip().split('\t')
if 'PS' not in line[-2]: continue
tmp = line[-1].split(':')
tmp2 = tmp[0].split('|')
line[-2] = line[-2].split(':')
if '|' not in tmp[0] or tmp2[0]==tmp2[1]:
continue
ps = int(tmp[line[-2].index('PS')])
chrom = line[0]
pos = int(line[1])
yield chrom, pos, ps
with open('../mp/run_methphaser_paper_chr6/hapcut2_v1.3.1/methphaser_zenodo/R9_60X/R9.60X.methtagged.vcf') as file:
what = {}
for chrom, pos, ps in get_phased_variant_group_id(file):
if ps not in what: what[ps] = 0
what[ps]+=1
print(f'Total phase blocks: {len(what)}')
for i in range(1,4):
print(f'blocks with {i} variants', len([_ for _ in what.values() if _==i]))
@Fu-Yilei Would you offer any comments? (Pinging in case github did not send notification upon reopening.)
Sorry I didn't got the notification, the group id did looks weird I need to further check the vcf generation scripts.
Thanks! Appreciate it :)
Hi,
I'm trying to replicate supplementary data 1 on a single chromosome. I have a few questions regarding the post processing of methphaser results.
1) I believe the phase blocks in the output vcf are kept untouched, and in order to run
whatshap compare
, one needs to refer to the csv files in the work directory and create an intermediate vcf file. Is this correct? I did a quick search inMethPhaser_paper_scripts/post_processing_script.ipynb
but parsing seems to be not included there.2) Could you clarify the difference between "Methylation connected Phase Block Accuracy" and "Methylation connected Phase Block Success Rate" in supplementary data 1?
3) Is the HG002 R10 data used in the manuscript this one?
Thank you!