Construct a string graph from several fasta sequences

joseale2310 commented 9 years ago

Hello!

I address you because I want to perform a little experiment using graphs. I would like to make a string graph from several similar reference genomes. Once made, I would like to convert the string graph again into a fasta file, which would be be my new unique reference genome. It would be like a Multiple Sequence Analysis, but using string graphs. The final goal is to use this last reference to map some reads that I have. Could it be done using SGA?

Another question is if it is possible to actually draw a string graph in a picture (GFA, maybe?)

Thanks!

Alex.

jts commented 9 years ago

Hi Alex,

I'm not sure SGA is the best tool for this. If you want to try though you'd have to chop your references into ~100bp overlapping pieces to construct the graph. You'd have to process the output of sga overlap yourself as the algorithms of sga assemble will not deal well with this type of input.

I think you could visualize the graph by using gfatools to convert the ASQG output to GFA, then using a program like bandage.

Jared

ekg commented 9 years ago

I think you can use fermikit to do this. https://github.com/lh3/fermikit

You would follow a standard assembly pipeline using the reference genomes as input. It might be necessary to fragment the input genomes or sample them somehow. (@lh3 --- thoughts?)

At some point in the process a .mag file is output. I believe this describes the fermi2 graph. You can also convert it to GFA https://github.com/lh3/mag2gfa, for import into a number of other tools. (vg being one--- I have not tested this but I want to).

On Wed, Sep 9, 2015 at 9:13 PM, Jared Simpson notifications@github.com wrote:

Hi Alex,

I'm not sure SGA is the best tool for this. If you want to try though you'd have to chop your references into ~100bp overlapping pieces to construct the graph. You'd have to process the output of sga overlap yourself as the algorithms of sga assemble will not deal well with this type of input.

I think you could visualize the graph by using gfatools https://github.com/lh3/gfatools to convert the ASQG output to GFA, then using a program like bandage http://rrwick.github.io/Bandage/.

Jared

— Reply to this email directly or view it on GitHub https://github.com/jts/sga/issues/92#issuecomment-139017101.

joseale2310 commented 9 years ago

Thanks guys! I was thinking something similar. I tried to use your program directly with my sequences and didn't seem to do what it is supposed to do. I will try what you said and chop the sequences. I will let you know the result if you are interested.

Thanks again!

Alex.

jts commented 9 years ago

I suggest starting with a very small region (a few KB) that contains differences between the references. This will give you a tractable data set to view in vg/bandage to see whether it is doing the right thing. I'm certainly interested to hear how it goes.

Jared

On Thu, Sep 10, 2015 at 12:37 PM, joseale2310 notifications@github.com wrote:

Thanks guys! I was thinking something similar. I tried to use your program directly with my sequences and didn't seem to do what it is supposed to do. I will try what you said and chop the sequences. I will let you know the result if you are interested.

Thanks again!

Alex.

— Reply to this email directly or view it on GitHub https://github.com/jts/sga/issues/92#issuecomment-139304992.

lh3 commented 9 years ago

With fermikit, you would need something like this.

your_split_script.py ref-concat.fa 100 90 > pseudo-reads.fa # chop to 100bp with 90bp overlap
fermi.kit/fermi2.pl unitig -p prefix -El 100 -m0 pseudo-reads.fa > prefix.mak
make -f prefix.mak

prefix.pre.gz is the assembly graph you want. As @jts said, sga can of course achieve the similar goal.

joseale2310 commented 9 years ago

Hi, Jared! So, I tried to chop my sequences (100 bp long and overlapping 90 bp). I got my reads.asqg.gz. I used sga assemble first to see what happens, I think that it worked fine! I do not really know how to use gfatools, as I cannot find any documentation, neither a README file. I asked @lh3, and I will let you know if I all of this worked!

ekg commented 9 years ago

If you get GFA then you can use your output in vg. Use vg index and vg find to extract regions of the graph (if it is large) and then you can visualize or read them them using vg view. Bandage is also really nice for visualizing small to medium size graphs, and reads GFA.

Very cool! I hope it did work! On Sep 11, 2015 10:58 AM, "joseale2310" notifications@github.com wrote:

Hi, Jared! So, I tried to chop my sequences (100 bp long and overlapping 90 bp). I got my reads.asqg.gz. I used sga assemble first to see what happens, I think that it worked fine! I do not really know how to use gfatools, as I cannot find any documentation, neither a README file. I asked @lh3 https://github.com/lh3, and I will let you know if I all of this worked!

— Reply to this email directly or view it on GitHub https://github.com/jts/sga/issues/92#issuecomment-139504509.

joseale2310 commented 9 years ago

Hi! @lh3 isn't answering me and I have no idea about how to use gfatools, Do you know how to use it? I am in a hurry, because these are my last days in the project!

Thanks!

ekg commented 9 years ago

If you have GFA then you can import the output into various other graph processing systems for further analysis. There's a list in the GFA-spec readme: https://github.com/pmelsted/GFA-spec

On Sun, Sep 13, 2015 at 11:36 AM, joseale2310 notifications@github.com wrote:

Hi! @lh3 https://github.com/lh3 isn't answering me and I have no idea about how to use gfatools, Do you know how to use it? I am in a hurry, because these are my last days in the project!

Thanks!

— Reply to this email directly or view it on GitHub https://github.com/jts/sga/issues/92#issuecomment-139858783.

joseale2310 commented 9 years ago

I think I did not explained myself. I don't know how to convert ASQG to GFA using gfatools. There is no wiki or documentation for that.

joseale2310 commented 9 years ago

Hi!

I am updating what I have been able to do. I finally made the gfa file from the contig-graph, and the read-graph. Then I used Bandage to visualize the graph and see what I got.

I first use only two reference genomes, chopped them into 100 bp and overlapping 90 bp. The graph seems to be fine, but it is a closed, circular graph. The genomes are not circular, so why is the graph closed?

I also tried to re-assemble one reference genome using its pseudo-reads, but the re-assembled contig does not have the same sequence has the original genome. I tried with bigger pseudo-reads (500 bp) and more overlap (150-350 bp), and it improved the result, but it still doesn't return the original genome.

ekg commented 9 years ago

Have you tried aligning the original genome to the graph?

On Fri, Sep 18, 2015 at 9:28 AM, joseale2310 notifications@github.com wrote:

Hi!

I am updating what I have been able to do. I finally made the gfa file from the contig-graph, and the read-graph. Then I used Bandage to visualize the graph and see what I got.

I first use only two reference genomes, chopped them into 100 bp and overlapping 90 bp. The graph seems to be fine, but it is a closed, circular graph. The genomes are not circular, so why is the graph closed?

I also tried to re-assemble one reference genome using its pseudo-reads, but the re-assembled contig does not have the same sequence has the original genome. I tried with bigger pseudo-reads (500 bp) and more overlap (150-350 bp), and it improved the result, but it still doesn't return the original genome.

— Reply to this email directly or view it on GitHub https://github.com/jts/sga/issues/92#issuecomment-141380489.

joseale2310 commented 9 years ago

How would I do that? I could try to align the original genome to the contigs, but I don't know which of the options of sga could align to the graph.

ekg commented 9 years ago

You can do it with vg. It would look something like this:

vg view --gfa-in x.gfa >x.vg
vg index -gd x.gcsa -k 16 -X 2 -x x.xg x.vg

Now you'd have indexes for the genome that allow mapping. You can map each sequence like this to test if it's in the genome as expected:

vg map -s $seq -x x.xg -g x.gcsa -k 32 -B 10000 | vg view -a -

The output will be in json. It's possible to visualize the alignments against small regions of the graph with vg find and vg view. Binaries for linux can be found @ https://github.com/ekg/vg/releases. I expect to push a new release later today. You might want to work from that.

I'll be super interested if this works. Thus far no one has tried realignment against an SGA graph. There are quite a few parameters in the mapper that you might want to experiment with. I expect it to be possible but may need some tweaking.

jts / sga

Construct a string graph from several fasta sequences #92