vgteam / long-read-giraffe-experiments

Experimental materials for testing the long-read mapping abilities of vg giraffe
Other
4 stars 0 forks source link

Evaluate performance on centromeres that are in the graph, using CHM13 reads #45

Closed adamnovak closed 2 months ago

adamnovak commented 4 months ago

I did a first pass using 10k HiFi reads in https://ucsc-gi.slack.com/archives/CJ2EHEH1A/p1719953970881309?thread_ts=1719905671.084099&cid=CJ2EHEH1A and I concluded that while there's something going on on chrY in CHM13, there's not obviously a huge pile of wrongly-mapped or unmapped centromeric reads.

But we should actually pull out simulated-from-centromere reads for CHM13 for R10 and HiFi, and map them, and see of centromere reads are worse than other reads, and how good they are overall.

We should also see if CHM13 centromere reads are notably better than HG002 centromere reads, since that would suggest that adding more centromeres to the graph is actually going to help us.

adamnovak commented 3 months ago

What I want to do here is:

Finding HG002 centromere reads might be harder because I'd want to look at reads simulated from the HG002 centromere, not reads with a CHM13 refpos in the centromere. So I'd need to get a BED of HG002 centromeres, and reads with refpos positions on it as well as CHM13.

adamnovak commented 3 months ago

The HG002 1.0.1 assemblies in the hub at https://genome.ucsc.edu/cgi-bin/hgGateway?hgHub_do_redirect=on&hgHubConnect.remakeTrackHub=on&hgHub_do_firstDb=1&hubUrl=https://research.nhgri.nih.gov/CustomTracks/T2T_hubs/HG002_Q100/hub.txt do also have cenSat tracks with bigBeds, like https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/annotation/centromere/hg002v1.0.1_v2.0/hg002v1.0.1.cenSatv2.0.noheader.bb

I think that's the right assembly we actually simulated from?