parklab / bamsnap

MIT License
113 stars 25 forks source link

IndexError: string index out of range #32

Open shin0727 opened 1 year ago

shin0727 commented 1 year ago

command : bamsnap -bam ./F1_dad.bam -ref new_GCA_024713975.2_ASM2471397v2_genomic.fna -ref_index_rebuild -pos CM045671.1:1-31859138

I got bam file for "olive flounder" species, and then names of the chromosomes are "CM045671.1", "CM045672.1" ... As I wanted to screen shot the alignment image for chromosome "CM045671.1", I set the position like above.

And then, the error occured: /home/jwshin0727/miniconda3/lib/python3.10/site-packages/pyfaidx-0.7.2.1-py3.10.egg/pyfaidx/init.py:523: RuntimeWarning: Index file /home/jwshin0727/CNU/Reference/Chinese/new_GCA_024713975.2_ASM2471397v2_genomic.fna.fai is older than FASTA file /home/jwshin0727/CNU/Reference/Chinese/new_GCA_024713975.2_ASM2471397v2_genomic.fna. warnings.warn( Process proc 1: Traceback (most recent call last): File "/home/jwshin0727/miniconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/jwshin0727/miniconda3/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/jwshin0727/miniconda3/lib/python3.10/site-packages/bamsnap-0.2.19-py3.10.egg/bamsnap/bamsnap.py", line 233, in run_process_drawplot_bamlist refseq = rseq.get_refseq(pos1) File "/home/jwshin0727/miniconda3/lib/python3.10/site-packages/bamsnap-0.2.19-py3.10.egg/bamsnap/bamsnap.py", line 543, in get_refseq refseq = self.get_refseq_from_localfasta(pos1) File "/home/jwshin0727/miniconda3/lib/python3.10/site-packages/bamsnap-0.2.19-py3.10.egg/bamsnap/bamsnap.py", line 592, in get_refseq_from_localfasta refseq[gpos+1] = seq[i] IndexError: string index out of range 2023-05-18 23:02:10,954 : [INFO] Total running time: 0.0 sec

How can I solve the problem??

AlexSCFraser commented 1 year ago

I have this same problem. I investigated the code and it's a problem with the function that loads ref data from the fasta file. Something is wrong with the position indexing causing the number of bases to be shorter than the range that's trying to read those bases. The data loader seems to add a configurable margin and well as an additional 500 bases on each side of the range. I'm not really sure why, maybe for visualization reasons? Anyway, i tried debugging it and mostly just got confused because it loads 599 bases when I use a range of 1-1700, but it only loads 1 base with a range of 1-581 and 19 bases with a range of 581-1700. Which means that somehow not only is the program running into indexing errors by making the range larger than the number of bases, but when I try to narrow the range it just loads barely any data at all, which is incredibly confusing. Normally I would expect the number of bases in range 1:1700 to equal the number of bases in any split of the range, so I simply don't understand how indexing is supposed to work in for this program. I think it's some kind of user error on the indexing, but the documentation doesn't really explain how to make this work. I get the sense the program was designed primarily for downloading reference data from a database so that the overlap margin data exists and this is causing weird out of range error behaviour for a full "chromosome" interval where you want to visualize the entire chromosome interval because your reference fasta database is made up of single genes (which is my use case).

mewmewyahhey commented 2 months ago

I met this problem too and I worked 1 day to solve it... I am gonna list some points that may be helpful if someone needs it.

The first key process is in _option.py, line112-113 (the line may be inaccurate because I have changed the code):

p1['g_spos'] = p1['t_spos'] - int(opt['margin']) 
p1['g_epos'] = p1['t_epos'] + int(opt['margin'])

The code afterward uses arguments 'g_spos' and 'g_epos', so make sure the margin is what you need. However, the _option.py doesn't receive the reference file, so I did not change it and just modified another scripts.

The second key process is in bamsnap.py, line 587:

def get_refseq_from_localfasta(self, pos1):

   spos = pos1['g_spos'] - int(opt['margin'])  - 500
   epos = pos1['g_epos'] + int(opt['margin'])  + 501
   seq = (self.get_refseq_from_fasta(pos1['chrom'], spos, epos, self.opt['ref_index_rebuild']))
   i = 0
   refseq = {}
   for gpos in range(spos, epos):
       refseq[gpos+1] = seq[I]
       i += 1
   return refseq, chrom_len

The code expands the range of selection for about 500 bp and 2 times of margin length. If you wanna show the whole region, it will tried to extract this longer range from your reference and caused error. My solution is to limit the spos and epos, just like:

spos = max(0, pos1['g_spos'] - 500)

However, it is easy to limit the spos but difficult for epos, because we don't know the max length of reference. So I changed the code of next function:

    def get_refseq_from_fasta(self, chrom, spos, epos, rebuild_index=False):
        f = self.fasta
        fastachrommap = {}
        for c1 in list(f.keys()):
            arr = c1.split(' ')
            tchrom = arr[0]
            fastachrommap[tchrom] = c1
        chrom_seq = f[fastachrommap[chrom]]
        chrom_len = len(chrom_seq)

        if spos < 0:
            spos = 0
        if epos >= chrom_len:
            epos = chrom_len - 1
        ref_seq = chrom_seq[spos:epos + 1]

        return str(ref_seq), chrom_len

So that it can return the chromosome length, which can be used to limit the epos. The final code is like:

    def get_refseq_from_localfasta(self, pos1):

        spos = max(0, pos1['g_spos'] - 500)
        epos = pos1['g_epos'] + 501
        seq, chrom_len = (self.get_refseq_from_fasta(pos1['chrom'], spos, epos, self.opt['ref_index_rebuild']))
        epos = min(chrom_len, epos)
        i = 0
        refseq = {}
        for gpos in range(spos, epos):
            if i >= len(seq):
                break
            refseq[gpos+1] = seq[i]
            i += 1
        return refseq, chrom_len

In this way, we can solve the problem in code "refseq[gpos+1] = seq[I]". However the change of spos and epos will cause some other problems. Sorry for I forget the exact error, and I will just paste the edited code below: line 293 in drawreadset.py:

    def calculate_readmap(self, is_strand_group=False):
        group_list = ['all']
        if is_strand_group:
            group_list.extend(self.STRAND_GROUP_LIST)

        for group in group_list:
            self.max_cov[group] = 0

        start_pos = max(0, self.g_spos - self.read_gap_w - 500)
        end_pos = self.g_epos + 500

        for a in self.samAlign.fetch(self.chrom, start_pos, end_pos):
            if len(a.positions) > 0:
                rid = self.get_rid(a)
                self.update_ref_seq_with_read(a)
                if not self.is_exist_read(rid):
                    r = DrawRead(a, self.refseq)
                    r.id = rid
                    r.read_gap_h = self.read_gap_h
                    r.read_gap_w = self.read_gap_w
                    r.panel_g_spos = self.g_spos
                    r.opt = self.opt
                    self.readset[rid] = r
                else:
                    r = self.readset[rid]

and line 27 in basetrack.py:

def draw(self, dr):
        y = self.h - 1
        y1 = 0
        y2 = y
        x1 = 0
        x2 = self.w

        yi = int( y / 2 ) - 5
        h = 10
        dr.line([(x1, 0), (x2, 0)], fill=COLOR['SEPARATE'], width=1)
        dr.line([(x1, y), (x2, y)], fill=COLOR['SEPARATE'], width=1)
        vcls = ""
        i0 = max(1, (1 - self.spos))
        for i in range(i0, self.g_len):
            if i > self.chrom_len - self.spos:
                break
            else:
                posi = self.spos + i
                base = self.refseq[posi]
                base = base.upper()

The trouble is, code above used an argument "self.chrom_len" which is not defined in the raw code. It was quoted in bamsnap.py. It originated from the function : get_refseq_from_fasta(), and was transported among several functions. It is hard to track it secondly, so if you want to solve this problem, work on yourself. Hope these work can help you. The project is fantastic, and just needs a little improvement. Thanks to the author.