Medaka graphs - Githubissues

ekg commented 8 years ago

I'm going to run vg msga on the two versions of chr24 in the Medaka genome assemblies (HNI and Hd-rR): http://utgenome.org/medaka_v2/#!Assembly.md. Will log progress here.

ekg commented 8 years ago

It looks like there might be a problem with the formatting of the Hd-rR assembly.

$ samtools faidx Medaka-Hd-rR-pacbio_version2.1.1.fasta
[fai_build_core] different line length in sequence '20'.

ekg commented 8 years ago

I have worked around this by:

fastahack -i Medaka-Hd-rR-pacbio_version2.1.1.fasta

Now I can construct a common fasta reference for both versions of chr24:

(samtools faidx Medaka-Hd-rR-pacbio_version2.1.1.fasta 24 | sed 's/^>24/>Hd-rR-24/' 
 samtools faidx Medaka-HNI-pacbio_version2.1.1.fasta 24 | sed 's/^>24/>HNI-24/' ) >chr24.fa
samtools faidx chr24.fa
cat chr24.fa.fai 
#Hd-rR-24        24173018        10      60      61
#HNI-24  21621617        24575920        60      61

So this looks as expected. The sequence lengths in the chr24.fa match the inputs.

I'll now move to apply vg msga as described https://github.com/vgteam/vg/wiki/Long-read-assemblies-using-vg-msga.

ekg commented 8 years ago

vg msga runs.

eg10@vr-4-1-16:/lustre/scratch113/projects/graphs/medaka$ time vg-81b5a2cb msga -f chr24.fa -B 128 -K 11 -X 2 -E 3 -G -S 0.95 -H 5 -D -t 16 >chr24.vg                                                                
loading chr24.fa
preparing initial graph
building xg index
building GCSA2 index
HNI-24: adding to graph1
HNI-24: aligning sequence of 21621617bp against 982801 nodes
HNI-24: editing graph
HNI-24: sorting and compacting ids
building xg index
building GCSA2 index
Hd-rR-24: adding to graph1
Hd-rR-24: aligning sequence of 24173018bp against 1967064 nodes
Hd-rR-24: editing graph
Hd-rR-24: sorting and compacting ids
building xg index
building GCSA2 index

real    201m58.516s
user    1378m44.370s
sys     9m4.202s

It's not quick, and could be sped up a number of ways, but the result is sensible. We can check with vg stats -zls chr24.vg:

# node and edge count
nodes   2085812
edges   2088364
# total length of sequences on nodes
length  45763830
# subgraph heads and length
1111025,1112138,2532950 45763830

This is not a good result! I have one subgraph but almost no compression. There are only 30kb of overlapping sequence!

So we basically have two whole chromosomes stuck together by just a tiny bit of sequence. Why does this happen? Could the divergence rate really be 25%, which is what's required for mapping failure?

adamnovak commented 7 years ago

I'm going to assume this project is finished.

ekg commented 7 years ago

Actually it is not, and probably should be explored again now that msga has stabilized.

On Fri, Jul 7, 2017, 20:33 Adam Novak notifications@github.com wrote:

I'm going to assume this project is finished.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/291#issuecomment-313759714, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EdJaELfpOWLHfzON0Uif7_fY_ASqks5sLnnhgaJpZM4IA78U .

vgteam / vg

Medaka graphs #291