holstegelab / otter

Generic targeted local assembler and genotyper for long-read data
MIT License
2 stars 0 forks source link

Failed to run the command #1

Open jiadong324 opened 3 weeks ago

jiadong324 commented 3 weeks ago

Dear author,

Thanks for developing this tool!

I am testing a few regions and found this strange results. This TRF annotated region has a 52bp INS, but it is not shown in the otter local assembly. I just extend the insertion breakpoint 500bp on both sides to create my BED file.

image
ansalaza commented 3 weeks ago

Hi,

Thanks for reaching out!

In short, it has to do with the fact that otter currently uses a distance-based approach to partition reads to separate haplotypes. So at some point if the sequences are large and the differences are small (e.g. low-sequence dissimilarity), they will be merge into a single haplotype.

The solution is to try to limit your target to your region of interest (e.g. TR annotation) or an SV-breakpoint with some (minimal) added flanking sequence.

To be concrete (and recreate your situation), I've attached an example dataset with a BAM file containing ONT simplex reads from HG002 for the region specified in your IGV screenshot. I think this genome has the same 52 INS in the TR you are interested.

If you take a look at the files, you'll find a BED file with two annotations: 1). chr20:16500753-16501809 (The region specified in the IGV screenshot) 2). chr20:16501261-16501374 (The TR as annotated at the UCSC Genome Browser, which is contained within the above annotation)

If you run otter with the default command, you should see that it reports heterozygous allele-sequences for each annotation, reflecting the 52 INS event: otter assemble -b tr.bed -R test hg002_ont_simplex.example.bam

If you run otter by extending each flanking sequence by 31 bp, you still get the same result: otter assemble -b tr.bed -R test -o 31 hg002_ont_simplex.example.bam

However, if you run otter by extending both flanking sequences with 500 bp, you'll see that the first annotation is collapsed into a single allele-sequence, while the TR annotation remains separated into two allele-sequences. otter assemble -b tr.bed -R test -o 500 hg002_ont_simplex.example.bam

Not sure what is the ONT chemistry of NA12878 data you are using for your sample, but from the IGV screenshot that you shared you can also see the the breakpoint of the 52 bp INS event is scattered within the TR annotation. So using an existing TR annotations and running otter on them should help with TR-calling for your sample.

Hope this answers your question! example_set_20241029.tar.gz