jts / sga

de novo sequence assembler using string graphs
http://genome.cshlp.org/content/22/3/549
237 stars 82 forks source link

scaffold2fasta: Assertion `expectedOverlap < (int)contigString.length()' failed #125

Open sjackman opened 8 years ago

sjackman commented 8 years ago

Any suggestions?

sga scaffold2fasta -m 500 --write-unplaced --use-overlap -o hsapiens-scaffolds.fa -a hsapiens-graph.asqg.gz hsapiens.scaf
Graph best: 0
Graph unique: 1
Find overlaps: 4
Deleted edges for 0 super repetitive vertices
Warning: removed 2900 duplicate edges
sga: ScaffoldRecord.cpp:317: bool ScaffoldRecord::introduceGap(int, const string&, const ScaffoldLink&, std::__cxx11::string&) const: Assertion `expectedOverlap < (int)contigString.length()' failed.
Vertices: 6703921 Edges: 28309816 Islands: 136547 Tips: 625936 Monobranch: 3553099 Dibranch: 1249436 Simple: 5058208
jts commented 8 years ago

I haven't seen that one before - did you use the -mean setting for DistanceEst?

sjackman commented 8 years ago

I did at first, and I had suspected that's what had gone wrong. But, no, I repeated the estimates with --mle and got the same error.

jts commented 8 years ago

I'm at a bit of a loss - could you send the .scaf file and .asqg.gz (or contig fasta, if the error appears with that input too). I won't be able to look until next week though.

sjackman commented 7 years ago

Here you go:

sga scaffold2fasta -m 500 --write-unplaced --use-overlap -o hsapiens-scaffolds.fa -a hsapiens-graph.asqg.gz hsapiens.scaf
sjackman commented 7 years ago

Any chance you've had some time to look into this issue, Jared?

jts commented 7 years ago

Sorry, no. My time is short these days I'm afraid. Is this blocking you from something?

sjackman commented 7 years ago

No worries. I am sympathetic to time constraints. We wanted to include SGA in an assembly comparison of GIAB HG004. SGA will be included in the comparison of contigs. Including scaffold results hinges on this issue.

sjackman commented 7 years ago

What's the format of the .scaf file? Perhaps I could use MergeContigs from ABySS as a hack work around to create the scaffolds FASTA file.

contig-6181580  contig-4954829,228,151.2,0,0,D  contig-197051,2178,122.2,0,1,D
sjackman commented 7 years ago

The format is:

std::ostream& operator<<(std::ostream& out, const ScaffoldLink& link)
{
    out << link.endpointID << "," << link.distance << "," << link.stdDev << ","
        << link.edgeData.getDir() << "," << link.edgeData.getComp() << "," << link.getTypeCode();
    return out;
}

https://github.com/jts/sga/blob/master/src/Scaffold/ScaffoldLink.cpp#L118

sjackman commented 7 years ago

Could I convert this to an ABySS path file like so, or is it not quite so simple? How do I convert the Dir and Comp components to a + / - orientation?

1    contig-6181580+ 100N contig-4954829+ 100N contig-197051-
sjackman commented 7 years ago

Is the .scaf file the final list of which contigs go in which scaffolds with one scaffold per line, or is it a list of distance estimates that still need to be merged into larger scaffolds?

jts commented 7 years ago

The former - it is the final list of contigs that go into scaffolds.

sjackman commented 7 years ago

Can you give me a brief description of how to convert a SGA scaf file to an ABySS path file? If you are able to convert the following .scaf record manually, I can figure it out and write the converter.

contig-1297628  contig-1106169,-24,25.3,0,1,D   contig-4496238,-47,17.3,1,1,D   contig-2839914,-92,242.9,0,1,D  contig-1247260,-62,97.3,1,1,D   contig-2437944,3840,177,0,0,D   contig-368915,112,186.4,0,1,D   contig-4976804,-306,22.8,1,0,D  contig-3455841,371,216.8,1,1,D  contig-3969805,201,94.5,0,0,D   contig-742829,-92,135.6,0,1,D   contig-3111798,11,28.9,1,1,D    contig-1750901,-79,27.5,0,0,D

Uses cases like this one are why I so badly want to a GFA standard! It would be great if scaffold2fasta and MergeContigs accepted the same file formats and were interchangeable. Please do comment on the proposed GFA 2 spec if you find the time. The format of the path record is one of the file items to nail down. https://github.com/GFA-spec/GFA-spec/pull/48

jts commented 7 years ago

Hi Shaun,

Agreed about GFA2.

I describe the file format here:

https://groups.google.com/forum/#!msg/sga-users/0pqCJwpPe8A/RRvCVPKU25QJ;context-place=forum/sga-users

The orientation bit describes whether the scaffold should be built left-to-right (the first contig in the record is the first contig in the scaffold) or right-to-left (the first contig is the last record). See here:

https://github.com/jts/sga/blob/master/src/Scaffold/ScaffoldRecord.cpp#L45

I suggest you don't spend much time on this - SGA is no longer under active development and there are better short read assemblers out there.

Sorry I can't help more!

Jared

sjackman commented 7 years ago

No worries, Jared. Quick comments are helpful. We're including SGA in this comparison to ABySS 2.0 due to its low memory usage. We're also comparing to DISCOVARdenovo, and it yields great contiguity, but takes a lot of memory and time.

From the code it appears that only the direction bit of the first link matters, and the rest are ignored. Is that correct? https://github.com/jts/sga/blob/master/src/Scaffold/ScaffoldRecord.cpp#L60

jts commented 7 years ago

Yes I believe that is correct but verify :)

On Wed, Nov 2, 2016 at 1:48 PM, Shaun Jackman notifications@github.com wrote:

No worries, Jared. Quick comments are helpful. We're including SGA in this comparison due to its low memory usage.

From the code it appears that only the direction bit of the first link matters, and the rest are ignored. Is that correct? https://github.com/jts/sga/blob/master/src/Scaffold/ScaffoldRecord.cpp#L60

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jts/sga/issues/125#issuecomment-257943999, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXxn0Uo125sAzXYLlyxrrUIKqEG5onfks5q6MzWgaJpZM4JYrqM .

sjackman commented 7 years ago

Here's a hacktastic script to convert a SGA .scaf file to an ABySS .path file. Thanks for your help, Jared.

#!/bin/sh
set -eu
exec gawk -e '
        !/\t/ { next }
        {
            # Determine whether the contigs are in reverse order.
            reverse = /^[^\t]*\t[^\t]*,1,[01],D/
            # Orient the contigs.
            $1 = $1 "+"
            rc = 0
            for (i = 2; i <= NF; ++i) {
                if ($i ~ /,1,D/)
                    rc = !rc
                sub(/,.*/, rc ? "-" : "+", $i)
            }
            # Print the contig IDs and orientations.
            printf "%u", id++
            if (reverse) {
                for (i = NF; i > 0; --i)
                    printf " %s", $i
                printf "\n"
            } else {
                print " " $0
            }
        }' "$@" \
    | gsed -e 's/ /\t/;s/ / 199N /g'
jts commented 7 years ago

Great, thanks for the script.