evaluation and clarification for supplementary data 1

xfengnefx commented 1 month ago

Hi,

I'm trying to replicate supplementary data 1 on a single chromosome. I have a few questions regarding the post processing of methphaser results.

1) I believe the phase blocks in the output vcf are kept untouched, and in order to run whatshap compare, one needs to refer to the csv files in the work directory and create an intermediate vcf file. Is this correct? I did a quick search in MethPhaser_paper_scripts/post_processing_script.ipynb but parsing seems to be not included there.

2) Could you clarify the difference between "Methylation connected Phase Block Accuracy" and "Methylation connected Phase Block Success Rate" in supplementary data 1?

3) Is the HG002 R10 data used in the manuscript this one?

Thank you!

Fu-Yilei commented 1 month ago

the intermediate files are just csv files include the relative relationships between the phaseblocks. No VCF file needed.
We mentioned it in the paper "To check the accuracy of our phase block connection, we need to determine the true relationship between two neighboring SNV phase blocks. Two neighboring SNV phase blocks are considered to have the same haplotype assignment if their haplotype assignments’ relationship against the truth VCF is the same. For example, if haplotype 1 in SNV phase block 1 is haplotype 2 in the truth VCF file while the SNV phase block 2 has the same relationship, i.e., the haplotype 1 in SNV phase block 2 is also haplotype 2 in the truth VCF, we consider the SNV phase block 1 and SNV phase block 2 are having the same haplotype assignments. Vice versa, two neighboring SNV phase blocks’ have switched relationships means their haplotype assignment relationships to the truth VCF are also switched. Given these truth relationship assignments, we can easily know the correctness of MethPhaser’s each SNV phase block connection with our relationship output and further get the correct N50 on the HG002 sample in Fig. 2. Success rate is just recall rate of phaseblock connection.
no it was in the GIAB website when we published the paper but they somehow removed it later.. I think right now this one is the closest. https://labs.epi2me.io/giab-2023.05/ or this one https://labs.epi2me.io/askenazi-kit14-2022-12/

Let me know if you have any issue replicating the task.

Fu-Yilei commented 1 month ago

Sorry I think I misunderstood what you mean by question 1. The notebook do not process the intermediate results, it compares the CSVs to get the value of your 2nd question. To run whatshap compare you need to run the meth_phaser_post_processing script instead of the jupyter notebook. Sorry for the confusion of the file name.

xfengnefx commented 1 month ago

Thanks for the reply! For question 1, I did run the meth_phaser_post_processing script, like so:

meth_phaser_parallel -b haplotagged.bam \
-g  phase.vcf.gtf \
-r ref.fasta \
-vc phased.vcf.gz  \
-o work

meth_phaser_post_processing \
-ib haplotagged.bam \
-if work  \
-ov output.vcf \
-ob output \
-vc phased.vcf.gz

where haplotagged.bam and phased.vcf.* is the outputs from whatshap or hapcut2.

Yet the output vcf has the same number of unique phase block IDs as the input vcf (from whatshap). This was HG002 chr6, and methphaser definitely joined many blocks in both the csv file and the bam output. Is there anything I should sanity check on? Am I understanding the phase block IDs wrong? I assume whatshap compare would use them, so blocks that joined together need to share an ID.

Fu-Yilei commented 1 month ago

Yes this is weird because the script does merge the blocks when it reads the csv files from the intermediate output. Phase block IDs are just the start locations of the phaseblocks, which should be reduced after merging. One quick way to check is to see if the haplotype assignment (1|0 or 0|1) is flipped in some blocks, which could help you make sure the script is modifying the vcf file.

xfengnefx commented 1 month ago

edit: after a while I realized the csv's "same" and "not same" is relative to the whatshap vcf, not absolute. Some of the correct/wrong phasing said about the intermediate csv below might be incorrect, please disregard them.

The vcf the bam are being modified, but seems not in all places that need modification, in terms of consistency with the intermediate csv...? I wonder what am doing wrong. The following are from these files:

workdir.tar.gz
methphaser.vcf.gz
scripts.tar.gz
whatshap vcf: in the next comment (github file size limit)

HG002 with hg38 as reference and 1-index. This joining is all correct:

chr6:23,954,211-24,015,472

whatshap: 1|0, 1|0
methphaser vcf: 1|0, 0|1, i.e. not same; phase group ID updated
methphaser bam: 1|0, 0|1
truth: not same
methphaser csv: "(20732283, 23954210)","(24014924, 26754886)","(20473626, 24014924)","(23954210, 26766920)",not same,7,99,895238,965577,639,711,32932,1827883,35,1315

But for the following gap:

chr6:11,092,382-11,147,866

whatshap: 1|0, 1|0
methphaser vcf and bam: 1|0, 1|0, phase group ID not updated
truth: not same
methphaser csv: "(125000, 11092382)","(11147866, 13206934)","(125000, 11147866)","(11092382, 13289571)",not same,37,86,1126452,842599,696,522,626943,1342108,362,856

Here the intermediate csv has the correct decision, but vcf and bam did not flip the haplotags of the reads...? The left and the right variants in whatshap vcf:

chr6    11092382        .       G       C       28.14   PASS    F;ANN=C|upstream_gene_variant|MODIFIER|SMIM13|SMIM13|transcript|NM_001135575.2|protein_coding||c.-1932G>C|||||1452|,C|intergenic_re
gion|MODIFIER|ELOVL2-AS1-SMIM13|ELOVL2-AS1-SMIM13|intergenic_region|ELOVL2-AS1-SMIM13|||n.11092382G>C||||||  GT:GQ:DP:AD:AF:PS       1|0:28:101:51,48:0.4752:125000
chr6    11147866        .       C       T       12.56   PASS    F;ANN=T|intergenic_region|MODIFIER|SMIM13-NEDD9|SMIM13-NEDD9|intergenic_region|SMIM13-NEDD9|||n.11147866C>T||||||       GT:GQ:DP:AD
:AF:PS       1|0:12:89:51,38:0.427:11147866

For chr6:42318581-42403518, the decision seems not right and phase group ID is not updated:

whatshap vcf: 1|0, 1|0
methphaser vcf: 1|0, 1|0, phase group ID not updated
truth: not same (screenshot below; giab vcf: left is 0|1 and right is 1|0)
methphaser csv: "(33798956, 42319150)","(42403518, 43291981)","(33677338, 42403518)","(42319150, 43347903)",same,111,84,1204060,2549426,1332,2945,2279035,1474451,2613,1664

In IGV:

(^ top track is 1Mb subsequences of T2T HG002 aligned to ref with minimap2 -c -x asm5)

Another maybe unrelated issue is seen at chr6:43,291,981-43,347,903 , which is correctly assigned & modified as a not-same joint. The intermediate csv's line is "(42403518, 43291981)","(43347903, 44366651)","(42319150, 43347903)","(43291981, 44420831)",not same,16,86,381831,1009123,412,1127,138569,1252385,130,1409. However, the 1st SNP on the right, chr6:43,347,903, does not have its haplotag flipped. The 2nd SNP (43369580) and afterwards are all correctly flipped.

I did not retain the stdout or stderr for this run, but I think the only warnings were:

/micromamba/envs/methphaser-env/bin/meth_phaser_parallel:241: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  phased_df_chr.get_group(chromosome).iterrows()
/micromamba/envs/methphaser-env/bin/methphasing:7: DeprecationWarning:pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  from pkg_resources import require

xfengnefx commented 1 month ago

whatshap.vcf.gz

Fu-Yilei commented 4 weeks ago

did you use the postprocessing script in your pull request version? I think that actually ignored the block merge I did.

xfengnefx commented 4 weeks ago

I did a clean conda installation, but still got the same symptoms somehow. On a r10.4.1 e8.2 HG002 chr6, whatshap's phasing had 62 unique phase block IDs, post methphaser phasing it becomes 58. Here are the steps of this run:

Variant calling and SNP phasing files are reused from older runs of the pipeline via softlinks. It was epi2me-labs/wf-human-variation (calling clair3) and whatshap. Remove conda env of previous methphaser, then micromamba clean --all followed by create new env and micromamba install -c conda-forge -c bioconda pysam=0.22.0 python methphaser line-profiler (pinning pysam because my note says otherwise a version 0.8 would be grabbed).

Run the following bash script:

conda activate methphaser-clean

printf '=== conda list\n' > log
conda list >> log

printf '\n\n=== checkusms\n' >> log
md5sum /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bin/meth_phaser_parallel >> l
og
md5sum /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bin/methphasing >> log
md5sum /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bin/meth_phaser_post_processi
ng >> log

printf '\n\n=== file directory at the start\n' >> log
ls -lah >> log

printf '\n\n=== methphaser parallel\n' >> log
meth_phaser_parallel -b haplotagged.F4F256F2048.bam \
-g  phased.vcf.gtf \
-r /media/groups/machine_learning/active/xfeng/ref/hg38.fa \
-vc phased.vcf.gz  \
-o work 1>>log 2>&1

printf '\n\n=== methphaser post porcessing\n' >> log
meth_phaser_post_processing \
-ib haplotagged.F4F256F2048.bam \
-if work  \
-ov output.vcf \
-ob output \
-vc phased.vcf.gz 1>>log 2>&1

printf '\n\n=== current directory\n' >> log
ls -lah >> log

Upon finish, the log file is: log.txt. Checksums from the log file are:

a318f6a578b8351a825e1b52c1b92a98  /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bi
n/meth_phaser_parallel
2e316e7dd0c2758c21c45d28b02268ae  /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bi
n/methphasing
6f4fb3e0a1f61843cbedf9839ae0f53d  /home/OXFORDNANOLABS/xfeng/micromamba/envs/methphaser-clean/bi
n/meth_phaser_post_processing

which should be identical with commit 0e44165, tag V0.0.3.

Outputs:

Parsing of phase block IDs:

def load_variants(fn, group_override={}):
    groups = {}
    with open(fn) as file:
        for line in file:
            if line[0]=='#': continue
            line = line.strip().split('\t')
            if len(line[3])!=1 or len(line[4])!=1: continue
            if line[3] not in 'ATCG' or line[4] not in 'ATCG': continue

            tags = line[-2].split(':')
            values = line[-1].split(':')

            if 'GT' not in tags: continue
            if 'PS' not in tags: continue

            genotype = values[tags.index('GT')]
            group = values[tags.index('PS')]
            if '|' not in genotype: continue
            try:
                group = int(group)
            except:
                continue
            if group in group_override:
                group = group_override[group]
            if genotype[0]==genotype[2]: continue

            pos = int(line[1])
            if group not in groups:
                groups[group] = {}
            groups[group][pos] = genotype
    return groups
tmp = load_variants('../mp/troubleshoot_methphaser_blockID_again/phased.vcf')
print(len(tmp))  # 62
tmp = load_variants('../mp/troubleshoot_methphaser_blockID_again/output.vcf')
print(len(tmp))  # 58

Do you by any chance still have the hapcut2 phased vcf files of the runs described in the paper? I'm using the intermediate csv files at the moment and would like to make sure my parsing is consistent with what would be obtained in the correct vcf.

Fu-Yilei commented 3 weeks ago

https://zenodo.org/records/11195009 This one I think

xfengnefx commented 3 weeks ago

I was aware of the zenodo link, but it did not contain the vcf produced by hapcut2 prior to methphase. I would like to see how many gaps were there for chr6 and run whatshap compare on both vcf. Are they available anywhere?

Putting that aside, I try to understand the block relationships using R9_60X chr6 methphaser-tagged vcf and the intermediate csv files:

There are 47 SNPs with block ID 157644, and 4088 SNPs with ID 382537. However, in the intermediate csv there is this line: 1,"(157644, 253570)","(382537, 4572100)","(157644, 382537)","(253570, 4615421)",same,same,247,143,3884744,4121821,4733,6306,4939375,3067190,4947,6092 which I think suggests that the two blocks are joined.
The next block ID is 4615421 and there are 1539 SNPs with it, but we have 2,"(382537, 4572100)","(4615421, 6028721)","(253570, 4615421)","(4572100, 6097430)",same,same,21,11,256483,125785,195,116,295698,86570,233,78 joining the two.

My understanding is here SNPs with ID 382537 and ID 4615421 should be reassigned to have ID 157644. But this seems to be not the case. What did I get wrong? Thanks!

Fu-Yilei commented 3 weeks ago

There are some other filters to decide whether two blocks need to be connected. For example, these two blocks actually don't have a clear majority of read that supports the relationship (247 VS 143 and 21 VS 11). In this case, to make sure of MethPhaser's accuracy, we actually did not take the block assignment into account in our final vcf output.

See these 2 parameters:

-mc , --minimum_coverage Minimum read number to assign blocks' relationship. default: 0. Recommanded setting: mc = coverage/(autosome-wide block number/1000), for details please see the paper -vd , --voting_difference minimum voting difference for relationship assignment, default=0.5

xfengnefx commented 3 weeks ago

I assume you mean the ratio between column "same_hap_num" and "diff_hap_num". Still using R9_60X chr6, pairs of phase blocks appear to be clear cuts also do not have all of their SNPs' group ID merged. Example of clear cut cases:

left=42473036, has 494 SNPs; right=43347903, has 6 SNPs; ratio is 0.00(99, 0)
35_43.csv:6,"(42473036, 43291981)","(43347903, 43405251)","(42318581, 43347903)","(43291981, 434
66484)",same,same,99,0,869687,837525,451,525,1707212,0,976,0

left=39323058, has 103 SNPs; right=39706505, has 1266 SNPs; ratio is 0.01(67, 1)
35_43.csv:2,"(39323058, 39658781)","(39706505, 41058383)","(39264926, 39706505)","(39658781, 411
15669)",same,same,67,1,457150,483654,313,388,939746,1058,698,3

left=157317226, has 1100 SNPs; right=158667958, has 3659 SNPs; ratio is 0.04(49, 2)
left=19106459, has 745 SNPs; right=20143505, has 262 SNPs; ratio is 0.06(129, 8)
left=37386219, has 627 SNPs; right=38322308, has 1079 SNPs; ratio is 0.06(78, 5)
left=4615421, has 1539 SNPs; right=6097430, has 3261 SNPs; ratio is 0.07(105, 7)
left=82746634, has 1 SNPs; right=83301314, has 12 SNPs; ratio is 0.07(9, 132)
...

Parsing code:

tmp = load_variants('../mp/run_methphaser_paper_chr6/hapcut2_v1.3.1/methphaser_zenodo/R9_60X/tmp_R9.60X.methtagged.chr6.vcf') # see above

d = '../mp/run_methphaser_paper_chr6/hapcut2_v1.3.1/methphaser_zenodo/R9_60X/chr6/'
ratios = []
for fn in os.listdir(d):
    if not fn.endswith('csv'): continue
    with open(d+fn) as file:
        for line in file:
            if line[0]==',': continue
            if 'decide' in line: continue
            line = line.replace('"(', '').replace(')"', '').replace(', ', '-').strip().split(',')

            start = int(line[1].split('-')[0])
            end = int(line[2].split('-')[0])
            joint1, joint2 = line[5], line[6]
            s1, s2 = int(line[7]), int(line[8])
            ratio = min([s1, s2])/max([s1, s2])
            if start in tmp and end in tmp:
                ratios.append([ratio, start, end, s1, s2])
ratios.sort(key=lambda x:x[0])
for ratio, start, end, s1, s2 in ratios:
    print(f'left={start}, has {len(tmp[start])} SNPs; right={end}, has {len(tmp[end])} SNPs; '+'ratio is %.2f'%ratio+f'({s1}, {s2})')

Fu-Yilei commented 3 weeks ago

I vaguely recall that there might be some filter based on CpG numbers in each block. But I need to check further

xfengnefx commented 3 weeks ago

Could you check if vcf is updated correctly, rather than what filtering criteria were used? I just would like to make sure this comment above is an expected result, and I'm free to take the vcf output as the final result & run whatshap compare on it for evaluation.

Fu-Yilei commented 3 weeks ago

yes please

xfengnefx commented 3 weeks ago

yes please

Just to clarify - do you mean given linked comment above, what I thought was vcf writing error is actually the expected behavior, i.e. when intermediate csv and vcf/bam disagree, I should disregard the csv?

Fu-Yilei commented 3 weeks ago

yes because the the postprocessing script is basically applying filters to the csv files and alter the VCFs. if you don't see the change in the vcf it means the line in that csv file didn't pass some filter.

xfengnefx commented 3 weeks ago

Got it, thank you so much! Closing this. I will test that PR later this week and update over there.

xfengnefx commented 3 weeks ago

Sorry, reopening for an additional question:

In supplementary table 1, sample R9 60x, "SNV Phasing Unconnected Phaseblock Number" is 3179, "MethPhaser Connected Phaseblock Number" is 1457. I think this means in the vcf produced by methphaser, there should be 3178-1457=1721 phase blocks. In the zenodo data, R9.60X.methtagged.vcf has 3183 unique phase block IDs, which seems to be almost the original amount. 708 of them has 1 heterozygous phased variant, 181 has 2, and 97 has3.

My local runs are like this, including the clean installation one mentioned above. Are you sure this is correct...?

edit: to be clear, I am not questioning methphaser's performance. By parsing the intermediate csv files & evaluated on them with custom scripts, methphaser was very nice on a local R9 HG002 chr6 run. I just wonder if phase group IDs are not updated either accidentally or intentionally in the final vcf, or I misinterpreted anything.

Sancheck, the vcf matches its md5 on zenodo (this file):

$ md5sum R9.60X.methtagged.vcf
b510c422168dcdffc5a02f97608cd8b8  R9.60X.methtagged.vcf

Here is the script for reproduce the stats:

def get_phased_variant_group_id(file):
    for line in file:
        if line[0]!='#': 
            line =line.strip().split('\t')
            if 'PS' not in line[-2]: continue
            tmp = line[-1].split(':')
            tmp2 = tmp[0].split('|')
            line[-2] = line[-2].split(':')
            if '|' not in tmp[0] or tmp2[0]==tmp2[1]:
                continue
            ps = int(tmp[line[-2].index('PS')])
            chrom = line[0]
            pos = int(line[1])
            yield chrom, pos, ps

with open('../mp/run_methphaser_paper_chr6/hapcut2_v1.3.1/methphaser_zenodo/R9_60X/R9.60X.methtagged.vcf') as file:
    what = {}
    for chrom, pos, ps in get_phased_variant_group_id(file):
        if ps not in what: what[ps] = 0
        what[ps]+=1
print(f'Total phase blocks: {len(what)}')
for i in range(1,4):
    print(f'blocks with {i} variants', len([_ for _ in what.values() if _==i]))

xfengnefx commented 1 week ago

@Fu-Yilei Would you offer any comments? (Pinging in case github did not send notification upon reopening.)

Fu-Yilei commented 1 week ago

Sorry I didn't got the notification, the group id did looks weird I need to further check the vcf generation scripts.

xfengnefx commented 1 week ago

Thanks! Appreciate it :)

treangenlab / methphaser

evaluation and clarification for supplementary data 1 #26