vgl-hub / gfastats

A single fast and exhaustive tool for summary statistics and simultaneous *fa* (fasta, fastq, gfa [.gz]) genome assembly file manipulation.
MIT License
91 stars 8 forks source link

Feature Request : Circular Indicator #32

Closed erinyoung closed 1 year ago

erinyoung commented 2 years ago

This is a microbial issue, but it would handy if the output for gfastats indicated whether or not a sequence is circular.

I have attached a gfa that I assembled with flye for an example. I had to change the filename to end with txt so that I could upload it to github, but it is a gfa file. Some tools that I use, such as raven, do not include a summary file. I like gfastats because of how useful it is, but it's missing this one key piece of information that would be immensely useful to me, and perhaps other members of the microbial sequencing community.

Here's the corresponding assembly_info.txt produced by flye for this sequence.

#seq_name   length  cov.    circ.   repeat  mult.   alt_group   graph_path
contig_2    7309750 21  Y   N   1   *   2
contig_1    566306  32  Y   N   1   *   1
contig_8    7902    20  N   Y   1   8   *,8,*
contig_7    7165    6   Y   Y   1   *   7

assembly_graph.gfa.txt

gf777 commented 2 years ago

Hello @erinyoung, thank you very much for your feedback. We are happy to work on this request. I'll give you an update asap

gf777 commented 1 year ago

Hi @erinyoung,

I've implemented the function you requested in v.1.3.6. It will count them in the summary, and provide individual results with --seq-report. Of note, we now have another option --discover-terminal-overlaps, that will determine perfect (i.e. such as those of a string graph) terminal overlaps in case they are missing.

Let me know what you think.

Best,

Giulio

erinyoung commented 1 year ago

That sounds spectacular! Thank you!

AmayAgrawal commented 1 year ago

Hi, Thanks for including this feature. It's quite useful for people working with microbial sequences.

However, I tested this feature on my data and found that the results of circularity from flye and gfastats are not same. I assembled the sequence using flye and the flye output says that both contigs in the output assembly are circular which I can see from .gfa file generated by flye as well, but gfstats says that the contigs are not circular. I am attaching all 3 files here (assembled fasta file (assembly.fasta.txt because github didn't allow .fasta extension files), assembly info from flye & assembly info from gfastats (generated by running this command: gfastats assembly.fasta --seq-report -t > assembly_info_gfastats.txt)). If I understood correctly, the last column in the 'assembly_info_gfastats.txt' represents the circularity. Let me know If I am understanding anything incorrectly or missing any parameter while running

assembly_info_gfastats.txt assembly_info_flye.txt assembly.fasta.txt

gf777 commented 1 year ago

Hi @AmayAgrawal

Thanks for reaching out. Please note that this is not how this is supposed to work. A FASTA file is by definition a linear sequence. A GFA instead can represent circularity as an edge connecting start and end in the graph. Feed the GFA to gfastats and it should be able to tell you that the sequences are indeed circular.

Note from the discussion above that we indeed introduced an option that will try to detect perfect overlaps of a certain length in FASTA files, but this needs to be specified with the --discover-terminal-overlaps N option. In the case of your file the overlaps are not perfect so it won't work (already tried), but you can see the result by say putting N = 1.

Btw I noticed that the header for circularity in the report was missing (hence your doubt, sorry about that). Fixed in the latest commit :-)