trgt-paper / tr-solve

A tool to resolve the structure of tandem repeats
2 stars 1 forks source link

What to do with N's? #4

Open hdashnow opened 1 year ago

hdashnow commented 1 year ago

Is there a way to sensibly skip over Ns? Ideally still annotate the surrounding sequence.

This is a sample sequence from a hard-masked hg38 ref.

echo TCCAAATGTCCACTTGGAGGCTCTACGAAAAGAATGTTTCANAACTGCTCCCATGAAAGCTATGTTATACTCTGGGAGTTGAACACAAGCCTCACAAAGGAGTTNCAGAGAAATGCTCTGTTNACTTNNTACGTGNAGATATTCCCGTTCCNAAGAGTCTCACAGAGTCCACTATCATTTGCGAGCTAGCAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGTTTCNACGAATCNTCAGAGAGTCCAAATTCCAGTAGAGTCTACAAAAAGTGTGTTTCAAAAATGCTCCACCCAAAGGAATGTTCAGCTCTGTGAGTTGAACTCAATCATCGCAAAGTATTTTCTGAGAATGCTTCTGTCCAGTTTTTA | tr-solve-v0.2.0-linux_x86_64
thread 'main' panicked at 'Enountered unknown base 78', src/model.rs:216:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
egor-dolzhenko commented 1 year ago

What do you think about replacing Ns with random bases (as a temporary fix)? Longer runs of random bases are very unlikely to look like repeats, so tr-solve will flag them as non-repetitive.

hdashnow commented 1 year ago

I'm planning to split the sequence at Ns and consider them separately then put the results back together. I can do that outside tr-solve for now.