vgl-hub / gfastats

A single fast and exhaustive tool for summary statistics and simultaneous *fa* (fasta, fastq, gfa [.gz]) genome assembly file manipulation.
MIT License
89 stars 8 forks source link

gfastats scaffold circularity #48

Closed k-krakowski closed 4 months ago

k-krakowski commented 4 months ago

I am writing to inquire about the possibility of obtaining scaffold circularity information from a .gfa file and incorporating it into the .fasta file headers. I have discovered that the --seq-report option enables me to ascertain the circularity of segments and find contigs of the same length in .fasta file. However, I am unsure how to proceed when there are two or more contigs with the same length (the IDs of contigs and segments are different, so I can't simply map that information). I would be grateful if you could clarify this matter. Thank you in advance for your assistance and for this software.

gf777 commented 4 months ago

Hi @k-krakowski Thank you for bringing this up. This is a very reasonable question. The function was actually implemented as part of this issue https://github.com/vgl-hub/gfastats/issues/32 and resolved by looking for segments that have self-overlaps. It seems to me that in your case it would make sense to ask if GFA paths have a start-end overlap, that is, there is an edge between the terminal segment and the first segment in the path. Would this work for your case? Generally speaking it would help if you could come up with a minimally working example of what you are trying to achieve All the best

k-krakowski commented 4 months ago

I regret that I am unable to share .gfa files prior to submission to NCBI, besides they are very heavy. So I add a little more details. With this command: gfastats assembly_graph_with_scaffolds.gfa -o output.fasta --discover-terminal-overlaps 500 --seq-report stdout has a lot of information such as:

78 1307107 36732 6076 12098 11775 6783 64.99 0 N 79 1307169 251919 39722 85719 86944 39534 68.54 0 Y 80 1307187 80660 12999 26946 27547 13168 67.56 0 N 81 1307199 6206 1263 1813 1785 1345 57.98 0 N ... 103 1317560 5340 1079 1634 1303 1324 55.00 0 N 104 1324916 121116 18811 41570 42151 18584 69.12 0 N 105 1326914 199845 31861 68260 68268 31456 68.32 0 Y 106 1328344 149 32 33 53 31 57.72 0 N 107 1328348 253 44 58 82 69 55.34 0 N

There are 139 segments, 8 of them are circular. In the output.fasta there are 27 scaffolds. I found scaffolds with the same length as circular segments. The example headers:

>NODE_2length\251919_cov_51.884337_1 >NODE_3length\199845_cov_49.367728_1

So, I thought that they can be considered as circular and annotation like "circular=true", similar to that unicycler places in .fasta headers, can be added to output.fasta file. I think this is what @gf777 mentioned - "GFA paths start-end overlap". Is it possible to achieve this with gfastats? Regards

gf777 commented 4 months ago

Hi @k-krakowski Sorry for the slow feedback. I think this should be doable. I'll push a fix hopefully later today

gf777 commented 4 months ago

Hi @k-krakowski I think I should have implemented what you are after. In the latest commit there are now two distinct flags --segment-report (originally --seq-report) and --path-report, which now reports whether a path looks circular as per the above definition. Hope this helps. Lmk