marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
655 stars 179 forks source link

circular contigs and contigs.gfa #1551

Closed GDKO closed 4 years ago

GDKO commented 4 years ago

Hi,

Now that contigs.gfa got removed, is there another file that gives you the CIGAR string of the self-alignment to identify trimming points for circular contigs?

skoren commented 4 years ago

Ah, good point, you'll have to rely on the unitig.gfa file. The bed file will give you a translation from contig to unitig and the gfa will tell you the alignment of the ends. Perhaps we should keep just the circular cigar in the contigs gfa or, better yet, we should trim the circular contigs ourselves when they're output.

brianwalenz commented 4 years ago

The raw data is still there, and you can manually compute the contigs.gfa output.

These outputs are created using unitigging/4-unitigger/alignGFA.sh. To make contigs.gfa, run the first command, replacing 'utgStore' with 'ctgStore' and 'unitigs' with 'contigs'.

  $bin/alignGFA \
    -T ../prefix.ctgStore 2 \
    -i ./prefix.contigs.gfa \
    -o ./prefix.contigs.aligned.gfa \
    -t 4 \
  > ./prefix.contigs.aligned.gfa.err 2>&1

You can speed this up greatly by editing the input 'prefix.contigs.gfa' so it includes only the circular contigs of interest, i.e., only keep LINK lines where the FROM and TO fields are the same contig.

GDKO commented 4 years ago

I use canu as part of a pipeline to assemble mitochondrial genomes and was using contigs.gfa and nucmer coords to effectively trim the ends. While it is very useful to be able to calculate the file as suggested, wouldn't it be easier to have it generated automatically for the circular contigs?

Kirk3gaard commented 4 years ago

They have added automated circular trimming as feature enhancement see #1230 and https://github.com/marbl/canu/projects/2#card-4243356.

skoren commented 4 years ago

With the 2.1 release, the circular contigs are aligned and report their trim positions in the fasta def line.