Open ekg opened 8 years ago
It looks like there might be a problem with the formatting of the Hd-rR assembly.
$ samtools faidx Medaka-Hd-rR-pacbio_version2.1.1.fasta
[fai_build_core] different line length in sequence '20'.
I have worked around this by:
fastahack -i Medaka-Hd-rR-pacbio_version2.1.1.fasta
Now I can construct a common fasta reference for both versions of chr24:
(samtools faidx Medaka-Hd-rR-pacbio_version2.1.1.fasta 24 | sed 's/^>24/>Hd-rR-24/'
samtools faidx Medaka-HNI-pacbio_version2.1.1.fasta 24 | sed 's/^>24/>HNI-24/' ) >chr24.fa
samtools faidx chr24.fa
cat chr24.fa.fai
#Hd-rR-24 24173018 10 60 61
#HNI-24 21621617 24575920 60 61
So this looks as expected. The sequence lengths in the chr24.fa match the inputs.
I'll now move to apply vg msga
as described https://github.com/vgteam/vg/wiki/Long-read-assemblies-using-vg-msga.
vg msga
runs.
eg10@vr-4-1-16:/lustre/scratch113/projects/graphs/medaka$ time vg-81b5a2cb msga -f chr24.fa -B 128 -K 11 -X 2 -E 3 -G -S 0.95 -H 5 -D -t 16 >chr24.vg
loading chr24.fa
preparing initial graph
building xg index
building GCSA2 index
HNI-24: adding to graph1
HNI-24: aligning sequence of 21621617bp against 982801 nodes
HNI-24: editing graph
HNI-24: sorting and compacting ids
building xg index
building GCSA2 index
Hd-rR-24: adding to graph1
Hd-rR-24: aligning sequence of 24173018bp against 1967064 nodes
Hd-rR-24: editing graph
Hd-rR-24: sorting and compacting ids
building xg index
building GCSA2 index
real 201m58.516s
user 1378m44.370s
sys 9m4.202s
It's not quick, and could be sped up a number of ways, but the result is sensible. We can check with vg stats -zls chr24.vg
:
# node and edge count
nodes 2085812
edges 2088364
# total length of sequences on nodes
length 45763830
# subgraph heads and length
1111025,1112138,2532950 45763830
This is not a good result! I have one subgraph but almost no compression. There are only 30kb of overlapping sequence!
So we basically have two whole chromosomes stuck together by just a tiny bit of sequence. Why does this happen? Could the divergence rate really be 25%, which is what's required for mapping failure?
I'm going to assume this project is finished.
Actually it is not, and probably should be explored again now that msga has stabilized.
On Fri, Jul 7, 2017, 20:33 Adam Novak notifications@github.com wrote:
I'm going to assume this project is finished.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/291#issuecomment-313759714, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EdJaELfpOWLHfzON0Uif7_fY_ASqks5sLnnhgaJpZM4IA78U .
I'm going to run vg msga on the two versions of chr24 in the Medaka genome assemblies (HNI and Hd-rR): http://utgenome.org/medaka_v2/#!Assembly.md. Will log progress here.