the difference of segments and path

Jerry-is-a-mouse commented 5 months ago

Dear @paoloshasta, When I used mode 2 to run shasta, I get three fasta format outputs. And, I am confused that what's the difference between segments and path as you described in https://[paoloshasta.github.io/shasta/](https://paoloshasta.github.io/shasta/). Can you help me to draw segments and paths and bubble chains in a picture to describe them more detailed. Best wishes! The picture is as follows.

paoloshasta commented 5 months ago

I use GFA terminology:

A segment is a contiguous piece of sequence, often called contig. In the figure you attached, the segments are green and red.
A link is a "connection" between two segments and drawn black in the above picture.
A path is a linear sequence of segments, where two consecutive segments in the sequence are joined by a link.

The write up you quote attempts to describe the Mode 2 assembly process, in which a Detailed assembly representation is first created, where each heterozygous locus corresponds to a small "bubble". During assembly, those small "bubbles" are phased relative to each other, which allows Shasta to create a path in the Detailed representation of the assembly which is believed to correspond to the true sequence of one haplotype.

The final result of this process, the Phased representation of the assembly is contained in Assembly-Phased.fasta and Assembly-Phased.gfa and consists of large "bubbles" like the ones in the figure you posted. Each segment in that picture corresponds to a path in the Detailed representation of the assembly, but you don't need to concern yourself with that.

You can use Bandage to generate a picture like the one you posted, for your assembly. You would load Assembly-Phased.gfa in Bandage.

So the final product (the Phased representation of the assembly contained in Assembly-Phased.fasta and Assembly-Phased.gfa) will look similar to the picture you posted. If will consist of "bubble chains", each consisting of phased regions with two haplotypes (green in the above picture, segment names beginning with PR) and unphased region in which only one "average" haplotype is assembled (red in the above picture, segment names beginning with UR). The Phased representation of the assembly could also contain additional segments that are not in bubble chains and have names consisting of numbers. These correspond to sequence that the assembler was not able to connect into bubble chains, they are usually short, and for your purposes you can probably ignore them.

I hope this helps, but feel free to ask for additional clarification if needed.

Jerry-is-a-mouse commented 5 months ago

@paoloshasta Thank you very much for your sincerely reply!!! And again, the information about output Assembly-Haploid.fasta. My project aims to get two haplotype genomes (can be only one haplotpye output) for evalution, so I need to use Assembly-Haploid.fasta for evaluation, or the other two fastas?

paoloshasta commented 5 months ago

The Phased description of the assembly contains as much phasing as the assembler was able to achieve. For example, with reference to the picture you posted, the assembler cannot tell if segment PR.0.29.31.0 is on the same haplotype as PR.0.31.27.0 or PR.0.31.27.1. This usually happens because of a long intervening stretch of sequence which is homozygous (or has a very low heterozygosity rate), and the assembler cannot find reads that contain sufficient information to phase the two bubbles relative to each other.

So the assembly is only able to provide two separately assembled haplotypes over the length of each pair of PR segments whose names differ only in their last field (.0 versus .1).

The Haploid representation of the assembly does not fill your needs because it simply discard the weakest haplotype at each PR bubble and then concatenates the remaining UR and PR segments to obtain a "haploid" representation of the assembly which is a mixture of the two haplotypes.

Due to read length limitations and the presence of nearly homozygous stretches, it is generally impossible to completely phase an assembly using just ONT reads. Here are some other things that you could consider trying:

Use longer ONT reads using and ONT Ultra-Long (UL) protocol. These can give reads with 70-100 Kb N50 which will considerably help phasing and give you longer PR segments.
Use GFAse in conjunction with parental or proximity ligation data to improve the phasing.
Take a look at the end of the discussion on issue #17. There I provided a script that will simply concatenate PR and UR segments to produce two "random" haplotypes. I call those "random" haplotypes because, due to the phasing uncertainty I discussed above, they can contain a haplotype switch error at each new PR location.

Jerry-is-a-mouse commented 5 months ago

Many thanks again! I see what you said. And that I think it meet my need enough. I just want to compare different phased diploid genome assembly tools without additional informations(eg. trio, HiC, Strand-Seq, etc.). And I have tried other tools to get primary contigs (and alternate contigs) .

paoloshasta commented 3 months ago

I am closing this due to lack of discussion. Feel free to open a new issue if additional discussion topics emerge.

paoloshasta / shasta

the difference of segments and path #23