mroosmalen / nanosv

SV caller for nanopore data
MIT License
90 stars 22 forks source link

Error while trying to call snps " UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 948: ordinal not in range(128)" #47

Closed herrroaa closed 5 years ago

herrroaa commented 6 years ago

Hi, I am trying to call snps from bam fil generated by minimap2, but I got an error

NanoSV BC01_sorted.bam -o BC01.vcf

I got this error message UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 948: ordinal not in range(128)

Here is the complete output

Traceback (most recent call last):
  File "/Users/tarekmagdyshehatamohamed/miniconda3/envs/bioinfo/bin/NanoSV", line 7, in <module>
    from nanosv.NanoSV import main
  File "/Users/tarekmagdyshehatamohamed/miniconda3/envs/bioinfo/lib/python3.6/site-packages/nanosv/__init__.py", line 1, in <module>
    from .utils import *
  File "/Users/tarekmagdyshehatamohamed/miniconda3/envs/bioinfo/lib/python3.6/site-packages/nanosv/utils/__init__.py", line 1, in <module>
    from . import coverage
  File "/Users/tarekmagdyshehatamohamed/miniconda3/envs/bioinfo/lib/python3.6/site-packages/nanosv/utils/coverage.py", line 8, in <module>
    import NanoSV
  File "/Users/tarekmagdyshehatamohamed/miniconda3/envs/bioinfo/lib/python3.6/site-packages/nanosv/utils/../NanoSV.py", line 29, in <module>
    cfg.read(args.config)
  File "/Users/tarekmagdyshehatamohamed/miniconda3/envs/bioinfo/lib/python3.6/configparser.py", line 697, in read
    self._read(fp, filename)
  File "/Users/tarekmagdyshehatamohamed/miniconda3/envs/bioinfo/lib/python3.6/configparser.py", line 1015, in _read
    for lineno, line in enumerate(fp, start=1):
  File "/Users/tarekmagdyshehatamohamed/miniconda3/envs/bioinfo/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 948: ordinal not in range(128)
jaesvi commented 6 years ago

See #45 Edit: Copy paste the answer from @mroosmalen here as well

This is because of the encoding language. If you use bash as your shell, you can put these lines in your ~/.bashrc and ~/.profile files export LC_CTYPE=en_US.UTF-8 export LANG=en_US.UTF-8

herrroaa commented 6 years ago

Thanks for the reply I added the two lines in my bashfiles, but did not work

jaesvi commented 6 years ago

Did you source your .bashrc or restart your terminal?

herrroaa commented 6 years ago

I restarted my terminal

herrroaa commented 6 years ago

Any other suggestions?

mroosmalen commented 6 years ago

Maybe you can try to make the following bash file (eg. nanosv.sh)

#!/usr/bin/bash

export LC_CTYPE=en_US.UTF-8
export LANG=en_US.UTF-8

NanoSV BC01_sorted.bam -o BC01.vcf

And try to execute this bash file like ./nanosv.sh

herrroaa commented 6 years ago

I tried it and unfortunately I got the same error message !

mroosmalen commented 6 years ago

Can you check your environment variables to be sure:

echo $LANG
echo $LC_ALL

Both should give en_US.UTF-8 as output.

You can also try this export LC_ALL=en_US.UTF-8.

And you also check the default encoding language in your python:

import sys
sys.getdefaultencoding()
herrroaa commented 6 years ago

Hi, Thanks for the reply export LC_ALL=en_US.UTF-8 worked, but I have one more question now. I need to generate a bed file for hg38 with chrx notation. I got simple_repeats_file is from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz and gaps_file is from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/gap.txt.gz

I then unziped the files and renamed then gap.bed and simpleRepeat.bed I modified the length of the chromosomes

genome = { 1: 248956422, 2: 242193529, 3: 198295559, 4: 190214555, 5: 181538259, 6: 170805979, 7: 159345973, 8: 145138636, 9: 138394717, 10: 133797422, 11: 135086622, 12: 133275309, 13: 114364328, 14: 107043718, 15: 101991189, 16: 90338345, 17: 83257441, 18: 80373285, 19: 58617616, 20: 64444167, 21: 46709983, 22: 50818468, 23: 156040895, 24: 57227415 }

simple_repeats_file = '/Users/tarekmagdyshehatamohamed/Downloads/gap.bed' gaps_file = '/Users/tarekmagdyshehatamohamed/Downloads/simpleRepeat.bed'

I then ran the .py file and I got this error message

$ cd /Users/tarekmagdyshehatamohamed/miniconda3/pkgs/nanosv-1.2.0-py36_1/lib/python3.6/site-packages/nanosv/bedfiles ;env "PYTHONIOENCODING=UTF-8" "PYTHONUNBUFFERED=1" /Users/tarekmagdyshehatamohamed/miniconda3/envs/pythontwopointseven/bin/python /Users/tarekmagdyshehatamohamed/.vscode/extensions/ms-python.python-2018.7.1/pythonFiles/PythonTools/visualstudio_py_launcher.py /Users/tarekmagdyshehatamohamed/miniconda3/pkgs/nanosv-1.2.0-py36_1/lib/python3.6/site-packages/nanosv/bedfiles 58853 34806ad9-833a-4524-8cd6-18ca4aa74f14 RedirectOutput,RedirectOutput /Users/tarekmagdyshehatamohamed/miniconda3/pkgs/nanosv-1.2.0-py36_1/lib/python3.6/site-packages/nanosv/bedfiles/create_human_hg38_bed.py
Traceback (most recent call last):
  File "/Users/tarekmagdyshehatamohamed/miniconda3/pkgs/nanosv-1.2.0-py36_1/lib/python3.6/site-packages/nanosv/bedfiles/create_human_hg38_bed.py", line 63, in <module>
    read_bed(simple_repeats_file)
  File "/Users/tarekmagdyshehatamohamed/miniconda3/pkgs/nanosv-1.2.0-py36_1/lib/python3.6/site-packages/nanosv/bedfiles/create_human_hg38_bed.py", line 46, in read_bed
    ch, start, end = line.split("\t")
ValueError: too many values to unpack
mroosmalen commented 6 years ago

This is because the download file are not a proper bed file. First you need to convert the .txt file to a .bed file.

cut -f 2,3,4 gap.txt > gap.bed
cut -f 2,3,4 simpleRepeat.txt > simpleRepeat.bed

Maybe you should also remove the chr notation in front of the chromosome names, depends on the bam file, if this has also the chr notation or not.

herrroaa commented 6 years ago

These are my bed files

$ cat gap.bed | head -n 10

chr1    0   10000
chr1    207666  257666
chr1    297968  347968
chr1    535988  585988
chr1    2702781 2746290
chr1    12954384    13004384
chr1    16799163    16849163
chr1    29552233    29553835
chr1    121976459   122026459
chr1    122224535   122224635

$ cat simpleRepeat.bed | head -n 5

chr1    10000   10468
chr1    10627   10800
chr1    10757   10997
chr1    11225   11447
chr1    11271   11448

When I run the .py file I got this error $ cd /Users/tarekmagdyshehatamohamed/miniconda3/pkgs/nanosv-1.2.0-py36_1/lib/python3.6/site-packages/nanosv/bedfiles ;env "PYTHONIOENCODING=UTF-8" "PYTHONUNBUFFERED=1" /Users/tarekmagdyshehatamohamed/miniconda3/envs/pythontwopointseven/bin/python /Users/tarekmagdyshehatamohamed/.vscode/extensions/ms-python.python-2018.7.1/pythonFiles/PythonTools/visualstudio_py_launcher.py /Users/tarekmagdyshehatamohamed/miniconda3/pkgs/nanosv-1.2.0-py36_1/lib/python3.6/site-packages/nanosv/bedfiles 50921 34806ad9-833a-4524-8cd6-18ca4aa74f14 RedirectOutput,RedirectOutput /Users/tarekmagdyshehatamohamed/miniconda3/pkgs/nanosv-1.2.0-py36_1/lib/python3.6/site-packages/nanosv/bedfiles/create_human_hg38_bed.py Traceback (most recent call last): File "/Users/tarekmagdyshehatamohamed/miniconda3/pkgs/nanosv-1.2.0-py36_1/lib/python3.6/site-packages/nanosv/bedfiles/create_human_hg38_bed.py", line 79, in <module> for mask_start in sorted(mask_regions[randchr]): KeyError: 5

mroosmalen commented 6 years ago

It can't find chromsome 5 in the bedfile, because in the bed it has this chr notation. Remove the chr notation in the bed file or add the chr notation to the chromsomes in the genome dictionary.

mroosmalen commented 6 years ago

There was a minor bug in the script. This should be fixed by now, in the newest version of NanoSV (v.1.2.1)

herrroaa commented 6 years ago

I used the new scrip create_random_position_bed.py , but I encountered some issues.

1- I had to change genome.iteritems() to genome.items() to be compatible with python3.6.3 that I am using.

2-I had to delete randchr = randchr.replace('chr','') , because it gave an error AttributeError: 'int' object has no attribute 'replace' I am not sure why this line is important?

3- There is a missing 'd' in ranchr at line 90 and line 91

4- I increased pick_random = 100000 to 1000000

5- using this script with gap and simplerepeats bed files with chr notation and genome dictionary with chromosome numbers without chr notation works fine. Eventually, I had to add chr notaion to my generated bed file becaus emy bam file has the chr notation. I used sed 's/^/chr/' test.bed > withchr.test.bed

What do you think about this?

mroosmalen commented 6 years ago

Sorry but there were still some bugs in the script. Can you try it again with the newest version (1.2.2)

herrroaa commented 6 years ago

I used the new script with bed files with chrnotation and genome dictionary as follows genome = { 1: 248956422, 2: 242193529, 3: 198295559, 4: 190214555, 5: 181538259, 6: 170805979, 7: 159345973, 8: 145138636, 9: 138394717, 10: 133797422, 11: 135086622, 12: 133275309, 13: 114364328, 14: 107043718, 15: 101991189, 16: 90338345, 17: 83257441, 18: 80373285, 19: 58617616, 20: 64444167, 21: 46709983, 22: 50818468, 23: 156040895, 24: 57227415 } Then I added chr notation to my hg38.bed file as my bam file has chr notation as well sed 's/^/chr/' hg38.bed > hg38.withchr.bed

cat hg38.withchr.bed | head -n 5
chr19   34671356    34671357
chr5    19440376    19440377
chr7    30238054    30238055
chr2    181665531   181665532
chr7    57515231    57515232
cat BC01_minimap.sam | head -n 5
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259

I then tried to use nanosv but I got an error

$ NanoSV BC01_sorted.bam -s /Users/tarekmagdyshehatamohamed/miniconda3/pkgs/samtools-1.8-3/bin/samtools -b hg38.withchr.bed -o BC01.noanosv.vcf
Fri Aug 10 19:28:37 2018 Busy with calculating the coverage distribution...
dyld: Library not loaded: @rpath/libdeflate.so
  Referenced from: /Users/tarekmagdyshehatamohamed/miniconda3/pkgs/samtools-1.8-3/bin/samtools
  Reason: image not found
Can't calculate coverage distribution. The bed file may be inappropriate for your bam file.

my bam file has chrX and chrY, while the hg38.withchr.bed has chr23and chr24. So, I thought this might causing the problem. I substituted chr23 with chrX and chr24 withchrY. I then ran the command one more time and it did not work

herrroaa commented 6 years ago

Any thoughts ?

mroosmalen commented 6 years ago

And if you try this on a subset of your bed file:

$SAMTOOLS depth $BAM -b $BEDFILE | awk '{print $3}'

Do you get any result?

herrroaa commented 6 years ago

yes I got some results

$ samtools depth BC01_sorted.bam -b hg38_withchr02_xyy.bed | awk '{print $3}' | head -n 5
1
1
1
1
1
mroosmalen commented 6 years ago

Did you use the same samtools path (I see samtools and /Users/tarekmagdyshehatamohamed/miniconda3/pkgs/samtools-1.8-3/bin/samtools? NanoSV execute the previous command to calculate the distribution and if it returns an empty list than you will get this error. But is seems that it will returns a list, so it should work.

I think the problem is samtools and the libdeflate.so library.