vibansal / HapCUT2

software tools for haplotype assembly from sequence data
BSD 2-Clause "Simplified" License
205 stars 36 forks source link

UnicodeDecodeError during linked_fragment_file generation #77

Open chunlinxiao opened 5 years ago

chunlinxiao commented 5 years ago

Traceback (most recent call last): File "./HapCUT2/utilities/LinkFragments.py", line 492, in link_fragments(args.fragments,args.VCF,args.bam_file, args.outfile, args.distance, args.single_SNP_frags) File "./HapCUT2/utilities/LinkFragments.py", line 327, in link_fragments flist = read_fragment_matrix(hairs_file,vcf_file,chrom_filter=curr_chrom) File "./HapCUT2/utilities/LinkFragments.py", line 212, in read_fragment_matrix for line in fm: File "./HapCUT2/myenv3/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 644: invalid start byte

Step1 of extractHAIRS for unlinked_fragment_file is okay.

But step2 of LinkFragments.py for linked_fragment_file generation had an error above.

Do you have any idea? thanks

FrickTobias commented 4 years ago

I have the same problem.

I've also identified and removed row causing the error (by reading the unlinked.txt file and using the python module f.seek(7010) and f.read(10)) but that just caused an error with position 7021 instead of 7010, so it seems there is a bigger problem here.

error message

cat BLR_out/hapcut2_linkfragments.log 
Linking 10X fragments on chromosome: chr1
Traceback (most recent call last):
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/LinkFragments.py", line 486, in <module>
    link_fragments(args.fragments,args.VCF,args.bam_file, args.outfile, args.distance, args.single_SNP_frags)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/LinkFragments.py", line 321, in link_fragments
    flist = read_fragment_matrix(hairs_file,vcf_file,chrom_filter=curr_chrom)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/LinkFragments.py", line 206, in read_fragment_matrix
    for line in fm:
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 7021: invalid start byte

Commands used to get input files

freebayes \
    -f genome_reference \
    input.bam \
    vars.vcf
vcffilter \
    -f "AC > 0" \
    vars.vcf \
    > vars-filt.vcf
extractHAIRS \
         --10X 1 \
         --bam input.bam \
         --VCF vars-filt.vcf \
         --out unlinked.txt
LinkFragments.py \
         --bam input.bam \
         -v vars-filt.vcf \
         --fragments unlinked.txt \
         --out linked.txt
pjedge commented 4 years ago

What happens if you change directory to the "utilities" directory and run test_LinkFragment.py? (Please change the value of EXTRACTHAIRS variable in the test script to the path of the extractHAIRS binary).

FrickTobias commented 4 years ago

Thanks for taking the time to answer.

I'm running it through conda so I am not 100% if my workaround will be equivalent to your description. Therefore I'll include a description of what I did.

Results

It seems to finish without any errors.

cd HapCUT2-1.2/utilities/
pytest .
pytest HapCUT2-1.2/utilities/
================================================ test session starts =================================================
platform linux -- Python 3.6.7, pytest-5.2.2, py-1.8.0, pluggy-0.12.0
rootdir: /domus/h1/tobiasf
collected 31 items                                                                                                   

HapCUT2-1.2/utilities/test_LinkFragment.py ...............................                                     [100%]

================================================= 31 passed in 1.20s =================================================

What I did

To run the test_LinkFragments.py I dowloaded the latest release's source files (v1.2) and then changed the EXTRACTHAIRS variable to the output given by which extractHAIRS. Everything was run in the same environment as the original post and I also double checked the conda HapCUT2 version distribution.

conda search -f hapcut2
Loading channels: done
# Name                       Version           Build  Channel             
hapcut2                          1.1      hed695b0_0  bioconda            
hapcut2                          1.2      hed50d52_0  bioconda
pjedge commented 4 years ago

Gotcha. So this bug is not caught by the existing tests. Can you show me which row of the unlinked fragment file was "causing" the crash?

FrickTobias commented 4 years ago

From the first error where it couldn't decode position 7010:

with open("unlinked.txt", "rb") as f:
    f.seek(7100)
    f.read(60)

b'0 F\n1 A00621:130:HN5HWDSXX:4:2467:8359:1595 2 AAAACAACGATCGT'
pjedge commented 4 years ago

This is very helpful. Would you mind seeking to a little earlier, and reading more, so that I can have the equivalent of 5-10 lines from the fragment file around that spot?

pjedge commented 4 years ago

I think I reproduced this bug with a 10x dataset of my own. Should be fixed soon.

FrickTobias commented 4 years ago

That's great, I'll leave the closing of the issue to you when it's been implemented. Let me know if you need something more.

Thank you so much for the help! If I could rate this interaction I'd give it a 10/10!

FrickTobias commented 4 years ago

This is very helpful. Would you mind seeking to a little earlier, and reading more, so that I can have the equivalent of 5-10 lines from the fragment file around that spot?

You'll find the row posted before highlighted in bold.

with open("unlinked.txt" ,"rb") as f: ... f.seek(6500) ... f.read(1000) ... 6500 b':HN5HWDSXX:4:2549:22426:20071 2 CTTGTGTGCTTCCTTGTTAG -1 1122 00 88\n1 A00621:130:HN5HWDSXX:4:1226:5538:10708 2 AAAACAACGATCGTTCTGCG -1 1132 0 F\n1 A00621:130:HN5HWDSXX:4:2229:6560:19304 2 TTAAGAAAGGCGTTAGCAAC -1 1134 0 F\n1 A00621:130:HN5HWDSXX:4:1557:6650:27007 2 CTTCGAACCTTGCGCACTTC -1 1141 0 F\n1 A00621:130:HN5HWDSXX:4:2365:19479:22169 2 CTAATTAATTTCGGTGTGTG -1 1145 0 F\n1 A00621:130:HN5HWDSXX:4:1502:12246:4695 2 GGTGGAAGTAAGCGCAGTAA -1 1145 00 FF\n1 A00621:130:HN5HWDSXX:4:2207:31331:23375 2 GAACGTTATTCGTGTAGAAA -1 1151 0 F\n1 A00621:130:HN5HWDSXX:4:2509:13485:19022 2 CTTAGGTCCGTCGTTCCTCA -1 1168 0 F\n1 A00621:130:HN5HWDSXX:4:2467:8359:1595 2 AAAACAACGATCGTTCTGCG -1 1178 00000000 FFFFFFFF\n1 A00621:130:HN5HWDSXX:4:2628:4589:9439_FILTERED 2 NULL -1 1189 0 :\n1 A00621:130:HN5HWDSXX:4:2314:5529:24220 2 GTAGGTAGGAAACTCCTTCC -1 1197 10 F:\n1 A00621:130:HN5HWDSXX:4:1145:28646:27743 2 GTACTTTATGTCTGCGCGCC -1 1198 0 <\n1 A00621:130:HN5HWDSXX:4:2341:10375:19210 2 CACGTGCATACAGTCGTATG -1 1198 0 ?\n1 A0062'

pjedge commented 4 years ago

So I didn't actually reproduce this same issue, it was a separate utf-8 issue that looked similar.

Would it be possible for you to send your entire unlinked.txt (just that file) to me? edge dot peterj at gmail dot com

FrickTobias commented 4 years ago

Alright, I've mailed the file to you.

subject: HapCUT2 issue #77 file
sender: tobias.frick@scilifelab.se
FrickTobias commented 4 years ago

A little bit down the road...

Fixed by two commits:

There was also a downstream problem with buffer sizes which weren't part of the original issue, but was patched to verify the initial utf-8 error patch. This turned out to be rooted to a hard-coded buffer size which needed to be increased.

I've just finished running through my data through all steps without any further hiccups so I consider the issue closed.

Thanks to @pjedge for his excellent perseverance in resolving this issue.

FrickTobias commented 4 years ago

<Anyone who has author privileges, feel free to close this issue>

chunlinxiao commented 4 years ago

thanks

chunlinxiao commented 4 years ago

still have the same errors:

LinkFragments.py 10X ... Linking 10X fragments on chromosome: scaffold_1 Traceback (most recent call last): File "/usr/local/HapMAP2/1.2/utilities/LinkFragments.py", line 486, in link_fragments(args.fragments,args.VCF,args.bam_file, args.outfile, args.distance, args.single_SNP_frags) File "/usr/local/HapMAP2/1.2/utilities/LinkFragments.py", line 321, in link_fragments flist = read_fragment_matrix(hairs_file,vcf_file,chrom_filter=curr_chrom) File "/usr/local/HapMAP2/1.2/utilities/LinkFragments.py", line 206, in read_fragment_matrix for line in fm: File "/usr/local/HapMAP2/1.2/venv/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 4911: invalid start byte

chunlinxiao commented 4 years ago

our systems installed v1.2 release, which may not include all the recent fixes. I guess this may be the reason I still got the same error.

DrPintoThe2nd commented 4 years ago

Hi,

I hit this same issue on the conda install and when I tried to quick install from github directly to troubleshoot, "make" generates this error:

"gcc -g -O3 -Wall -D_GNU_SOURCE -Ihtslib -c hairs-src/bamread.c -lhts -o build/bamread.o In file included from hairs-src/bamread.c:1:0:
hairs-src/bamread.h:10:10: fatal error: htslib/sam.h: No such file or directory #include "htslib/sam.h" ^~~~~~
compilation terminated.
Makefile:33: recipe for target 'build/bamread.o' failed
make: *** [build/bamread.o] Error 1"

htslib v1.9 is in my path.

Best, bjp

vibansal commented 4 years ago

If you add the actual path to 'htslib' to the Makefile (variable HTSLIB), the code should compile.

I pushed a minor change to LinkFragments.py that could potentially fix the UnicdoeDecode error. If you could please try and let me know, it would be really helpful.

DrPintoThe2nd commented 4 years ago

Downloaded the new LinkFragments.py and successfully installed htslib and HapCut2 from github and everything appears to be working!! Will post confirmation when it finishes completely, but it has passed that place where it used to error.

Thanks so much for your quick response and fix to this issue!! Really needed a win today! Best, bjp

DrPintoThe2nd commented 4 years ago

Hi Bansal lab, I appears that the LinkFragments.py has been successfully patched, although it's still running (going on 10 days) and is only about 1/3 of the way through the genome (purely by scaffold #).. Is this a normal run-time for this script? Thanks again! Best, bjp

pjedge commented 4 years ago

Not normal, but depends on the data. What's the scale of the dataset?

On Sun, Apr 19, 2020, 9:35 AM Brendan J. Pinto notifications@github.com wrote:

Hi Bansal lab, I appears that the LinkFragments.py has been successfully patched, although it's still running (going on 10 days) and is only about 1/3 of the way through the genome (purely by scaffold #).. Is this a normal run-time for this script? Thanks again! Best, bjp

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vibansal/HapCUT2/issues/77#issuecomment-616177572, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUBV2EEB7ULT5LKKVGCGR3RNMR6DANCNFSM4H4NKTYQ .

DrPintoThe2nd commented 4 years ago

1.8Gb genome, ~3,000 scaffolds (w/ scaffold L90 of 112) w/ ~1.7 billion 10X reads.

Currently approaching scaffold 1,000 w/o any notable increase in speed (although these scaffolds are might small).

olavurmortensen commented 3 years ago

I'm experiencing the same error as @chunlinxiao and @FrickTobias with v1.3.

I installed HapCUT2 v1.3.2 via conda (https://anaconda.org/bioconda/hapcut2). It's installed in one of my projects here. The recipe for the bioconda pacakge can be found here: https://bioconda.github.io/recipes/hapcut2/README.html

Just to be clear, I don't think this is an issue with the bioconda recipe. In the recipe YAML it clearly uses v1.3.2 as source.

vibansal commented 3 years ago

Can you please share a dataset for reproducing this error? I can provide an upload link if needed. Thanks.

olavurmortensen commented 3 years ago

@vibansal If I use the method used above to find the particular line that causes the problem, will that suffice? I cannot share the full dataset.

EDIT:

The "method used above" being the following.

with open("unlinked.txt", "rb") as f:
    f.seek(7100)
    f.read(60)
vibansal commented 3 years ago

Yes, having a few lines around the problematic line will be helpful.

olavurmortensen commented 3 years ago

Hi, sorry about leaving this hanging. The problem solved itself. Most likely there was some versioning confusion.

You can close this issue.