ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

Peculiar consensus sequence #151

Closed andreaswallberg closed 4 years ago

andreaswallberg commented 5 years ago

Dear @ruanjue ,

I am working on the assembly of a large and complex genome and have noticed some odd motifs in the consensus sequence, that may be associated with micro-satellites. Basically, I seem to get microsatellite-associated homopolymer sequences in the consensus sequence that do not appear to be supported by any mapped ONT long-read (I have mapped with both minimap2 and ma "modular aligner").

Admittedly, I have not done a systematic scan for these but just seen two cases when eyeballing a single contig with samtools tview. By their very nature, these regions consist of low complexity DNA but I am still puzzled by the result and wonder if needs to be brought to attention by you and the developers. Cheers!

homopolymer1

homopolymer2

ruanjue commented 5 years ago

Thanks so much. First I will add a option in wtpoa-cns to output the mapped coordinates between layout file and consensus sequences. Then, I will ask for your help to run the consensus again, and find the coordinating part in layout file. Last, I will debug on the located small region of layout file.

I will replay this issue after I finish the new option.

Jue

ruanjue commented 5 years ago

https://github.com/ruanjue/wtdbg2/commit/6329a5f0e2635c0b3a2c6db6a9115700089ca5a7. wtpoa-cns <other-options> -e map.txt

#ctg ctg_off edge edge_full_len edge_off edge_len
ctg1        0           E0          3002        0           2916
ctg1        2916        E1          2785        917         2785
ctg1        4784        E2          2054        1026        1965
ctg1        5723        E3          2040        931         1658
ctg1        6450        E4          2580        627         2168
ctg1        7991        E5          1847        508         1847
ctg1        9330        E6          1575        4           1575
ctg1        10901       E7          2278        1           2278
ctg1        13178       E8          2747        1038        2128
ctg1        14268       E9          2010        461         2010

You can locate the peculiar regions by ctg+ctg_off, then find the problemic edge. E0 is the first layout block started with E, so on the E1, E2, ...

Best, Jue