Evaluate performance on centromeres that are in the graph, using CHM13 reads

adamnovak commented 4 months ago

I did a first pass using 10k HiFi reads in https://ucsc-gi.slack.com/archives/CJ2EHEH1A/p1719953970881309?thread_ts=1719905671.084099&cid=CJ2EHEH1A and I concluded that while there's something going on on chrY in CHM13, there's not obviously a huge pile of wrongly-mapped or unmapped centromeric reads.

But we should actually pull out simulated-from-centromere reads for CHM13 for R10 and HiFi, and map them, and see of centromere reads are worse than other reads, and how good they are overall.

We should also see if CHM13 centromere reads are notably better than HG002 centromere reads, since that would suggest that adding more centromeres to the graph is actually going to help us.

adamnovak commented 3 months ago

What I want to do here is:

[ ] Add a filter in vg filter to let me intersect a BED with CHM13-simulated read refpos annotations
[ ] Grab the CHM13 centromeric regions from https://hgdownload.soe.ucsc.edu/gbdb/hs1/censat/censat.bb with bigBedToBed
[ ] Do the intersection and get a read name list from my 10k reads for in and out of the centromere
[ ] Pull those read sets from the compared GAM and count the portion marked correct
[ ] Do some kind of statistical test to see if reads in or out of the centromere are significantly differently likely to be correct, or to be mapped

Finding HG002 centromere reads might be harder because I'd want to look at reads simulated from the HG002 centromere, not reads with a CHM13 refpos in the centromere. So I'd need to get a BED of HG002 centromeres, and reads with refpos positions on it as well as CHM13.

adamnovak commented 3 months ago

The HG002 1.0.1 assemblies in the hub at https://genome.ucsc.edu/cgi-bin/hgGateway?hgHub_do_redirect=on&hgHubConnect.remakeTrackHub=on&hgHub_do_firstDb=1&hubUrl=https://research.nhgri.nih.gov/CustomTracks/T2T_hubs/HG002_Q100/hub.txt do also have cenSat tracks with bigBeds, like https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/annotation/centromere/hg002v1.0.1_v2.0/hg002v1.0.1.cenSatv2.0.noheader.bb

I think that's the right assembly we actually simulated from?

vgteam / long-read-giraffe-experiments

Evaluate performance on centromeres that are in the graph, using CHM13 reads #45