ncbi / SKESA

SKESA assembler
Other
112 stars 17 forks source link

Help needed in understanding Saute output #40

Open swarnalilouha opened 1 year ago

swarnalilouha commented 1 year ago

I used Saute to assemble a reference fasta sequence '>CRYPT1020_1' from Illumina reads. I got 2 assemblies:

CRYPT1020_1:1:1:87926 2 4 1 CRYPT1020_1:1:2:87918 2 3 1

Why are there 2 assemblies and what do the numbers in the fasta headers mean?

souvorov commented 1 year ago

Often time assemblies have multiple variants. The simplest case are SNPs. SAUTE arranges all variants in a graph (output is controlled by --gfa option). You can analyze this graph if you install BANDAGE (https://rrwick.github.io/Bandage/). Up to 1000 variants are printed by SAUTE in --all_variants in the fasta format. The first part of the fasta ID is Target name:graph number:contig number:estimated k-mer count. After that the numbers of the used graph nodes are printed separated by a space. From what you posted one can say that your graph has two variants. The difference is represented by nodes 3 and 4. You should either look at the graph or align two contigs to understand what kind of difference they have.

atongsa commented 9 months ago

Can SAUTE be used to assemble whole genome sequencing (WGS) data for humans?

souvorov commented 8 months ago

SAUTE was designed for assembling bacterial genes. It is not appropriate for assembling the human genome.

From: atongsa @.> Sent: Saturday, January 27, 2024 6:32 AM To: ncbi/SKESA @.> Cc: Souvorov, Alexander (NIH/NLM/NCBI) [E] @.>; Comment @.> Subject: [EXTERNAL] Re: [ncbi/SKESA] Help needed in understanding Saute output (Issue #40)

Can SAUTE be used to assemble whole genome sequencing (WGS) data for humans?

- Reply to this email directly, view it on GitHubhttps://github.com/ncbi/SKESA/issues/40#issuecomment-1913125794, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGIEUFRKGMSTYC74O2H4X2DYQTQS7AVCNFSM6AAAAAA2LUBT3SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJTGEZDKNZZGQ. You are receiving this because you commented.Message ID: @.**@.>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.

atongsa commented 8 months ago

yes, i not mean the whole genome, but only specific genes in human genome with SAUTE using human WGS

souvorov commented 8 months ago

Try the target sequences slightly exceeding the area of the gene of interest. It should work, unless there are large insertions/deletions/rearrangements inside the gene introns.

atongsa commented 8 months ago

thank you very much