sanger-pathogens / circlator

A tool to circularize genome assemblies
http://sanger-pathogens.github.io/circlator/
Other
229 stars 58 forks source link

fixstart: no suitable promer matches found (due to deletion in dnaA sequence of assembler output) #78

Open 0xaf1f opened 7 years ago

0xaf1f commented 7 years ago

I've hit what's probably a rare corner case. Circularization completed successfully, but the start position of the chromosome was not set properly. I checked the logs for fixstart and this is what I see:

==> 06.fixstart.detailed.log <==
No suitable promer matches found
Using the following prodigal predictions to circularize contigs:
scf7180000000004|quiver Prodigal_v2.6.1 CDS 2205287 2207290 249.2   +   0   ID=1_2144;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.678;conf=99.99;score=249.24;cscore=242.85;sscore=6.39;rscore=1.96;uscore=0.46;tscore=3.96;

==> 06.fixstart.log <==
[fixstart]  id  break_point gene_name   gene_reversed   new_name    skipped
[fixstart]  scf7180000000004|quiver 2205287 prodigal    no  -   -

When I BLAST my organism's reference dnaA sequence against my assembly, I notice that my assembly's version is one base shorter. I suspect that there's a frameshift deletion in the assembly that mangles the translated protein sequence, preventing promer from making any matches. The deletion may be false and buff out during assembly polishing, but it still caused trouble here.

martinghunt commented 7 years ago

Thanks, yes we know that this can happen. I could add in an option to not require a full length match to the reference gene? I'm thinking require the start of the gene to match, as I'd still want to have the assembly start fixed to be the methionine at the start of the gene. But relax the requirement to have a promer match all the way to the end of the gene.

Would that be helpful to you?

0xaf1f commented 7 years ago

I'm not sure it would. My 06.fixstart.promer.promer look like this:

PROMER

[S1]    [E1]    [S2]    [E2]    [LEN 1] [LEN 2] [% IDY] [% SIM] [% STP] [LEN R] [LEN Q] [FRM]   [TAGS]
3294129 3294689 1524    964 561 561 98.93   98.93   1.07    4414018 1524    3   -1  scf7180000000004|quiver gi|448814763:1-1524 
3294130 3294681 1523    972 552 552 100.00  100.00  2.17    4414018 1524    1   -2  scf7180000000004|quiver gi|448814763:1-1524 
3294131 3294694 1522    959 564 564 98.40   99.47   5.32    4414018 1524    2   -3  scf7180000000004|quiver gi|448814763:1-1524 
3294675 3295649 977 3   975 975 99.69   99.69   2.46    4414018 1524    3   -2  scf7180000000004|quiver gi|448814763:1-1524 
3294676 3295650 976 2   975 975 99.69   100.00  7.08    4414018 1524    1   -3  scf7180000000004|quiver gi|448814763:1-1524 
3294679 3294131 974 1522    549 549 100.00  100.00  2.19    4414018 1524    -1  2   scf7180000000004|quiver gi|448814763:1-1524 
3294680 3295651 972 1   972 972 100.00  100.00  1.23    4414018 1524    2   -1  scf7180000000004|quiver gi|448814763:1-1524 
3294680 3294129 973 1524    552 552 100.00  100.00  0.54    4414018 1524    -3  1   scf7180000000004|quiver gi|448814763:1-1524 
3294759 3294130 888 1523    630 636 89.15   91.51   3.54    4414018 1524    -2  3   scf7180000000004|quiver gi|448814763:1-1524 
3295649 3294678 3   974 972 972 100.00  100.00  4.94    4414018 1524    -3  3   scf7180000000004|quiver gi|448814763:1-1524 
3295650 3294676 2   976 975 975 99.69   99.69   1.85    4414018 1524    -2  2   scf7180000000004|quiver gi|448814763:1-1524 
3295651 3294677 1   975 975 975 100.00  100.00  0.00    4414018 1524    -1  1   scf7180000000004|quiver gi|448814763:1-1524

I think it would be difficult to get a good cutoff. Also, there would still be the problem of a frameshift close to the beginning of the gene that would almost entirely change the amino acid sequence.

I already fixed this particular case using my pre-circlator script, which used blastn to find the start position and sequence orientation. A nucleotide search (rather than an amino acid search) makes more sense to me, but this would significantly change your algorithm. Then the sequence position that corresponds to the first base of the start codon (even if it's a mismatch) could be set to the start position.