parklab / bamsnap

MIT License
109 stars 23 forks source link

Cant manage to get working with alternative FASTA #13

Open ebioman opened 3 years ago

ebioman commented 3 years ago

I was eager to try your tool and the examples in the repository work well. I tried directly plotting from your examples as well as re-alignment (minimap2) on hg38 with subsequent plotting. In that case both without and with provided fasta sequence worked.

Whenever I try though to use a non-human reference genome, I get an error :

2021-01-18 10:51:57,140 : [INFO] Total running time: 0.0 sec
Process proc 1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/site-packages/bamsnap/bamsnap.py", line 227, in run_process_drawplot_bamlist
    refseq = rseq.get_refseq(pos1)
  File "/usr/local/lib/python3.7/site-packages/bamsnap/bamsnap.py", line 537, in get_refseq
    refseq = self.get_refseq_from_localfasta(pos1)
  File "/usr/local/lib/python3.7/site-packages/bamsnap/bamsnap.py", line 586, in get_refseq_from_localfasta
    refseq[gpos+1] = seq[i]
IndexError: string index out of range

In order to assure that is should work I took really headers such as ">sequence1",">sequence2" and so on and verified that reads mapped actually on provided positions but it always gave the same error. Any hints how to fix that, or plans to add to your test repository some user reference sequences to make it easier to understand what is the proper syntax in that case ?

erinyoung commented 3 years ago

Can I second this issue?

I want to use this to visualize reads of SARS-CoV-2 with reference MN908947.3

The command:

$ bamsnap -ref MN908947.3.fasta -bam test.primertrim.sorted.bam -pos MN908947.3:241 -out test.bamsnap.png 

The error:

2021-01-19 11:16:15,958 : [INFO] Total running time: 0.0 sec
Process proc 1:
Traceback (most recent call last):
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/site-packages/bamsnap/bamsnap.py", line 227, in run_process_drawplot_bamlist
    refseq = rseq.get_refseq(pos1)
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/site-packages/bamsnap/bamsnap.py", line 537, in get_refseq
    refseq = self.get_refseq_from_localfasta(pos1)
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/site-packages/bamsnap/bamsnap.py", line 586, in get_refseq_from_localfasta
    refseq[gpos+1] = seq[i]
IndexError: string index out of range
ebioman commented 3 years ago

Looking at the code, this part seems to be source of the error:

    def get_refseq_from_localfasta(self, pos1):
        spos = pos1['g_spos']-self.opt['margin'] - 500
        epos = pos1['g_epos']+self.opt['margin'] + 1 + 500
        seq = self.get_refseq_from_fasta(pos1['chrom'], spos, epos, self.opt['ref_index_rebuild'])
        i = 0
        refseq = {}
        for gpos in range(spos, epos):
            refseq[gpos+1] = seq[i]
            i += 1
        return refseq

Where we can see already that start and end-position take the provided position and shift by 500bp + the given margin (default 50bp) in order to find anything to plot at all. @erinyoung Might that be the issue in your case, that since you have the position 241bp that start would be 241 - 500 -50bp = -309bp which would be out of range ?

Will have to go back to mine and see whether that applies - certainly not for all which I tested.....

erinyoung commented 3 years ago

This still did not work:

$ bamsnap -ref MN908947.3.fasta -bam test.sorted.bam -pos MN908947.3:1181 -out test.bamsnap.png

2021-02-08 15:47:54,183 : [INFO] Total running time: 0.0 sec
Process proc 1:
Traceback (most recent call last):
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/site-packages/bamsnap/bamsnap.py", line 229, in run_process_drawplot_bamlist
    imagefname = bsplot.drawplot_bamlist(pos1, image_w, bamlist, xscale, refseq)
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/site-packages/bamsnap/bamsnap.py", line 475, in drawplot_bamlist
    ia = self.append_geneplot_image(ia, pos1, image_w, xscale)
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/site-packages/bamsnap/bamsnap.py", line 429, in append_geneplot_image
    geneplot = GenePlot(pos1['chrom'], pos1['g_spos'], pos1['g_epos'], xscale,
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/site-packages/bamsnap/geneplot.py", line 101, in __init__
    self.load_gene_structure()
  File "/home/eriny/anaconda3/envs/bamsnap/lib/python3.9/site-packages/bamsnap/geneplot.py", line 114, in load_gene_structure
    for rec in self.gene_annot_tb.querys(pos_str):
tabix.TabixError: query failed

I was reading https://github.com/parklab/bamsnap/issues/11#issue-785047461 and finally got bamsnap working for me:

bamsnap -draw coordinates bamplot coverage base \
     -ref MN908947.3.fasta -bam test.sorted.bam -pos MN908947.3:1181 -out test.bamsnap.png

I guess I just need to be wary of the edges.

tkonopka commented 3 years ago

Encountered similar issues when drawing short chromosomes and also near chromosome edges. I made some edits in a branch in a fork. (branch cli_margin)

An example synthetic fasta sequence and synthetic alignment:

>random_small
GTATGATACCTAGATAATAAAGGCTGCTAGGAGCCCTTTAACAGAGGACCTATACGTAAG
GCACATGAAAAATGAGTTAGCGGCAGACCTATGTTGTAAGCTAGCTCGCGGAGTAAACGT
@SQ SN:random_small LN:120
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -o example.sam small.fa small.fastq
random_small-1  0   random_small    1   60  60M *   0   0   GTATGATACCTAGATAATAAAGGCTGCTAGGAGCCCTTTAACAGAGGACCTATACGTAAG    ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss    NM:i:0MD:Z:60   AS:i:60 XS:i:0
random_small-2  0   random_small    61  60  60M *   0   0   GCACATGAAAAATGAGTTAGCGGCAGACCTATGTTGTAAGCTAGCTCGCGGAGTAAACGT    ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss    NM:i:0MD:Z:60   AS:i:60 XS:i:0

The current version gives IndexError on this data. In the edits, setting -pos random_small:1-120 (entire chromosome) and -margin 0, it becomes possible to draw the chromosome edge-to-edge.

entire-chromosome

When -margin is set to nonzero and would cause the genomic region off the chromosome boundaries, the alignment becomes left-justified and leaves empty space on the right.

margin-outside

The examples on readthedocs also seem to generate images (sometimes with one base different on the right). Does this seem reasonable?

@danielmsk - I'd be happy to send a pull request with this. There is potential these changes may break other things that I may not be aware of. It would be great to get your input.

gtrichard commented 1 year ago

This is still an issue and really undermines bamsnap scope and usefulness.

lskatz commented 1 year ago

We added 1000 to the start and subtracted 1000 from the end position and I really really hate that this worked.