nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

convey gaps in consensus sequence via a designated character, e.g. N #377

Closed tkonopka closed 1 year ago

tkonopka commented 1 year ago

Addresses #348

Introduces option to convey gaps in consensus sequences via a designated character, e.g. 'N'.

Example

Consider the following two contigs aligned to a reference

reference           ACACCGCGGTGTTATA
contigs             ACAC    GTGC    

The default behavior for medaka_consensus is to use information from contigs where available, and fill gaps by copying content from the reference. Alternatively, the -g option splits the consensus into separate pieces.

The PR introduces capability to produce a consensus similarly as in the default mode, but fill gaps with a designated character such as 'N'.

reference           ACACCGCGGTGTTATA
contigs             ACAC    GTGC    
consensus (default) ACACCGCGGTGCTATA
consensus (-g)      ACAC
                            GTGC
consensus (-r N)    ACACNNNNGTGCNNNN

The new scheme conveys what parts of the consensus are based on data, and what parts are based on prior knowledge.

Command line interface

The shell commands to produce the above results are:

medaka_consensus -i reads.fa -d reference.fa          # default behavior
medaka_consensus -i reads.fa -d reference.fa -g       # separate lines
medaka_consensus -i reads.fa -d reference.fa -r N     # fill gaps with N

The PR also introduces a new option --fill_char to medaka stitch.

mwykes commented 1 year ago

Thanks for the contribution! I'll take care of merging it into our internal repo and getting your commit into the next release (when it will also be pushed to github medaka master).