parklab / bamsnap

MIT License
109 stars 23 forks source link

IndexError: string index out of range #32

Open shin0727 opened 1 year ago

shin0727 commented 1 year ago

command : bamsnap -bam ./F1_dad.bam -ref new_GCA_024713975.2_ASM2471397v2_genomic.fna -ref_index_rebuild -pos CM045671.1:1-31859138

I got bam file for "olive flounder" species, and then names of the chromosomes are "CM045671.1", "CM045672.1" ... As I wanted to screen shot the alignment image for chromosome "CM045671.1", I set the position like above.

And then, the error occured: /home/jwshin0727/miniconda3/lib/python3.10/site-packages/pyfaidx-0.7.2.1-py3.10.egg/pyfaidx/init.py:523: RuntimeWarning: Index file /home/jwshin0727/CNU/Reference/Chinese/new_GCA_024713975.2_ASM2471397v2_genomic.fna.fai is older than FASTA file /home/jwshin0727/CNU/Reference/Chinese/new_GCA_024713975.2_ASM2471397v2_genomic.fna. warnings.warn( Process proc 1: Traceback (most recent call last): File "/home/jwshin0727/miniconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/jwshin0727/miniconda3/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/jwshin0727/miniconda3/lib/python3.10/site-packages/bamsnap-0.2.19-py3.10.egg/bamsnap/bamsnap.py", line 233, in run_process_drawplot_bamlist refseq = rseq.get_refseq(pos1) File "/home/jwshin0727/miniconda3/lib/python3.10/site-packages/bamsnap-0.2.19-py3.10.egg/bamsnap/bamsnap.py", line 543, in get_refseq refseq = self.get_refseq_from_localfasta(pos1) File "/home/jwshin0727/miniconda3/lib/python3.10/site-packages/bamsnap-0.2.19-py3.10.egg/bamsnap/bamsnap.py", line 592, in get_refseq_from_localfasta refseq[gpos+1] = seq[i] IndexError: string index out of range 2023-05-18 23:02:10,954 : [INFO] Total running time: 0.0 sec

How can I solve the problem??

AlexSCFraser commented 1 year ago

I have this same problem. I investigated the code and it's a problem with the function that loads ref data from the fasta file. Something is wrong with the position indexing causing the number of bases to be shorter than the range that's trying to read those bases. The data loader seems to add a configurable margin and well as an additional 500 bases on each side of the range. I'm not really sure why, maybe for visualization reasons? Anyway, i tried debugging it and mostly just got confused because it loads 599 bases when I use a range of 1-1700, but it only loads 1 base with a range of 1-581 and 19 bases with a range of 581-1700. Which means that somehow not only is the program running into indexing errors by making the range larger than the number of bases, but when I try to narrow the range it just loads barely any data at all, which is incredibly confusing. Normally I would expect the number of bases in range 1:1700 to equal the number of bases in any split of the range, so I simply don't understand how indexing is supposed to work in for this program. I think it's some kind of user error on the indexing, but the documentation doesn't really explain how to make this work. I get the sense the program was designed primarily for downloading reference data from a database so that the overlap margin data exists and this is causing weird out of range error behaviour for a full "chromosome" interval where you want to visualize the entire chromosome interval because your reference fasta database is made up of single genes (which is my use case).