velocyto-team / velocyto.py

RNA velocity estimation in Python
http://velocyto.org/velocyto.py/
BSD 2-Clause "Simplified" License
159 stars 82 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa7 in position 0: invalid start byte #322

Open vagabond12 opened 2 years ago

vagabond12 commented 2 years ago

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa7 in position 0: invalid start byte

When running with the repeat mask file, it always report the errors, and I do not know why.

$ velocyto run10x -m hg38_rmsk.gtf W11 refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf

2021-12-06 17:03:32,095 - DEBUG - Seen 33583 genes until now 2021-12-06 17:03:32,095 - DEBUG - Parsing Chromosome Y strand - [line 2440379] 2021-12-06 17:03:32,102 - DEBUG - Done with Y- [line 2442622] 2021-12-06 17:03:32,102 - DEBUG - Assigning indexes to genes 2021-12-06 17:03:32,102 - DEBUG - Seen 33635 genes until now 2021-12-06 17:03:32,102 - DEBUG - Parsing Chromosome Y strand + [line 2442623] 2021-12-06 17:03:32,111 - DEBUG - Assigning indexes to genes 2021-12-06 17:03:32,111 - DEBUG - Done with Y+ [line 2445372] 2021-12-06 17:03:32,111 - DEBUG - Fixing corner cases of transcript models containg intron longer than 1000Kbp 2021-12-06 17:03:33,274 - DEBUG - Generated 2058282 features corresponding to 167370 transcript models from /home/soft/refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf 2021-12-06 17:03:33,298 - INFO - Load the repeat masking annotation from /home/soft/hg38_rmsk.gtf 2021-12-06 17:03:33,298 - DEBUG - Reading /home/soft/hg38_rmsk.gtf, the file will be sorted in memory Traceback (most recent call last): File "/home/jxx/anaconda3/envs/r403/bin/velocyto", line 8, in sys.exit(cli()) File "/home/jxx/anaconda3/envs/r403/lib/python3.9/site-packages/click/core.py", line 1128, in call return self.main(args, kwargs) File "/home/jxx/anaconda3/envs/r403/lib/python3.9/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/home/jxx/anaconda3/envs/r403/lib/python3.9/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/jxx/anaconda3/envs/r403/lib/python3.9/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/jxx/anaconda3/envs/r403/lib/python3.9/site-packages/click/core.py", line 754, in invoke return __callback(args, **kwargs) File "/home/jxx/anaconda3/envs/r403/lib/python3.9/site-packages/velocyto/commands/run10x.py", line 112, in run10x return _run(bamfile=(bamfile, ), gtffile=gtffile, bcfile=bcfile, outputfolder=outputfolder, File "/home/jxx/anaconda3/envs/r403/lib/python3.9/site-packages/velocyto/commands/_run.py", line 196, in _run mask_ivls_by_chromstrand = exincounter.read_repeats(repmask) File "/home/jxx/anaconda3/envs/r403/lib/python3.9/site-packages/velocyto/counter.py", line 340, in read_repeats gtf_lines = [line for line in open(gtf_file) if not line.startswith('#')] File "/home/jxx/anaconda3/envs/r403/lib/python3.9/site-packages/velocyto/counter.py", line 340, in gtf_lines = [line for line in open(gtf_file) if not line.startswith('#')] File "/home/jxx/anaconda3/envs/r403/lib/python3.9/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa7 in position 0: invalid start byte

kanefos commented 2 years ago

also having this error!

nroak commented 1 year ago

I have this issue as well-

sort: invalid option -- 't'
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -o FILE    Write final output to FILE rather than standard output
  -O FORMAT  Write output as FORMAT ('sam'/'bam'/'cram')   (either -O or
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam       -T is required)
  -@ INT     Set number of sorting and compression threads [1]

Legacy usage: samtools sort [options...] <in.bam> <out.prefix>
Options:
  -f         Use <out.prefix> as full final filename rather than prefix
  -o         Write final output to stdout rather than <out.prefix>.bam
  -l,m,n,@   Similar to corresponding options above
Traceback (most recent call last):
  File "/home/noak/.conda/envs/velocyto/bin/velocyto", line 8, in <module>
    sys.exit(cli())
  File "/home/noak/.conda/envs/velocyto/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/noak/.conda/envs/velocyto/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/noak/.conda/envs/velocyto/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/noak/.conda/envs/velocyto/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/noak/.conda/envs/velocyto/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/noak/.conda/envs/velocyto/lib/python3.8/site-packages/velocyto/commands/run10x.py", line 112, in run10x
    return _run(bamfile=(bamfile, ), gtffile=gtffile, bcfile=bcfile, outputfolder=outputfolder,
  File "/home/noak/.conda/envs/velocyto/lib/python3.8/site-packages/velocyto/commands/_run.py", line 196, in _run
    mask_ivls_by_chromstrand = exincounter.read_repeats(repmask)
  File "/home/noak/.conda/envs/velocyto/lib/python3.8/site-packages/velocyto/counter.py", line 340, in read_repeats
    gtf_lines = [line for line in open(gtf_file) if not line.startswith('#')]
  File "/home/noak/.conda/envs/velocyto/lib/python3.8/site-packages/velocyto/counter.py", line 340, in <listcomp>
    gtf_lines = [line for line in open(gtf_file) if not line.startswith('#')]
  File "/home/noak/.conda/envs/velocyto/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 0: invalid start byte
`
robingarcia commented 11 months ago

I also have this error. How can I solve it?

(cellcycle) robin.garcia@rie-hpc-h723001:~/data/transcriptomics:velocyto run10x results/cr_eq_MFGE8 resources/Equus.caballus_genome/genes/genes.gtf.gz 2023-09-20 08:45:49,659 - DEBUG - Using logic: Default 2023-09-20 08:45:49,663 - INFO - Read 4413 cell barcodes from /data/users/robin.garcia/transcriptomics/results/cr_eq_MFGE8/outs/filtered_feature_bc_matrix/barcodes.tsv.gz 2023-09-20 08:45:49,663 - DEBUG - Example of barcode: AAACCCAAGCGGTATG and cell_id: cr_eq_MFGE8:AAACCCAAGCGGTATG-1 2023-09-20 08:45:49,665 - DEBUG - Peeking into /data/users/robin.garcia/transcriptomics/results/cr_eq_MFGE8/outs/possorted_genome_bam.bam 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 8 of the bam file 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 9 of the bam file 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 10 of the bam file 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 18 of the bam file 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 19 of the bam file 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 27 of the bam file 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 36 of the bam file 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 81 of the bam file 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 82 of the bam file 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 83 of the bam file 2023-09-20 08:45:49,694 - WARNING - Not found cell and umi barcode in entry 84 of the bam file 2023-09-20 08:45:49,695 - WARNING - Not found cell and umi barcode in entry 151 of the bam file 2023-09-20 08:45:49,695 - WARNING - Not found cell and umi barcode in entry 156 of the bam file 2023-09-20 08:45:49,695 - WARNING - Not found cell and umi barcode in entry 157 of the bam file 2023-09-20 08:45:49,695 - WARNING - Not found cell and umi barcode in entry 158 of the bam file 2023-09-20 08:45:49,695 - WARNING - Not found cell and umi barcode in entry 159 of the bam file 2023-09-20 08:45:49,695 - WARNING - Not found cell and umi barcode in entry 160 of the bam file 2023-09-20 08:45:49,695 - WARNING - Not found cell and umi barcode in entry 161 of the bam file 2023-09-20 08:45:49,695 - WARNING - Not found cell and umi barcode in entry 163 of the bam file 2023-09-20 08:45:49,695 - WARNING - Not found cell and umi barcode in entry 164 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 165 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 166 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 167 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 168 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 169 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 170 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 171 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 178 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 187 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 189 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 190 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 193 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 222 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 223 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 229 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 238 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 283 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 296 of the bam file 2023-09-20 08:45:49,696 - WARNING - Not found cell and umi barcode in entry 297 of the bam file 2023-09-20 08:45:49,697 - WARNING - Not found cell and umi barcode in entry 519 of the bam file 2023-09-20 08:45:49,697 - WARNING - Not found cell and umi barcode in entry 586 of the bam file 2023-09-20 08:45:49,697 - WARNING - Not found cell and umi barcode in entry 637 of the bam file 2023-09-20 08:45:49,697 - WARNING - Not found cell and umi barcode in entry 655 of the bam file 2023-09-20 08:45:49,697 - WARNING - Not found cell and umi barcode in entry 656 of the bam file 2023-09-20 08:45:49,697 - WARNING - Not found cell and umi barcode in entry 658 of the bam file 2023-09-20 08:45:49,697 - WARNING - Not found cell and umi barcode in entry 659 of the bam file 2023-09-20 08:45:49,697 - WARNING - Not found cell and umi barcode in entry 660 of the bam file 2023-09-20 08:45:49,697 - WARNING - Not found cell and umi barcode in entry 661 of the bam file 2023-09-20 08:45:49,698 - WARNING - Not found cell and umi barcode in entry 662 of the bam file 2023-09-20 08:45:49,698 - WARNING - Not found cell and umi barcode in entry 663 of the bam file 2023-09-20 08:45:49,698 - WARNING - Not found cell and umi barcode in entry 970 of the bam file 2023-09-20 08:45:49,698 - WARNING - Not found cell and umi barcode in entry 1006 of the bam file 2023-09-20 08:45:49,698 - WARNING - Not found cell and umi barcode in entry 1007 of the bam file 2023-09-20 08:45:49,698 - WARNING - Not found cell and umi barcode in entry 1009 of the bam file 2023-09-20 08:45:49,703 - INFO - Starting the sorting process of /data/users/robin.garcia/transcriptomics/results/cr_eq_MFGE8/outs/possorted_genome_bam.bam the output will be at: /data/users/robin.garcia/transcriptomics/results/cr_eq_MFGE8/outs/cellsorted_possorted_genome_bam.bam 2023-09-20 08:45:49,703 - INFO - Command being run is: samtools sort -l 7 -m 2048M -t CB -O BAM -@ 16 -o /data/users/robin.garcia/transcriptomics/results/cr_eq_MFGE8/outs/cellsorted_possorted_genome_bam.bam /data/users/robin.garcia/transcriptomics/results/cr_eq_MFGE8/outs/possorted_genome_bam.bam 2023-09-20 08:45:49,703 - INFO - While the bam sorting happens do other things... 2023-09-20 08:45:49,703 - INFO - Load the annotation from /data/users/ms-ag-biotech/reference/Equus_caballus/109/Equus.caballus_genome/genes/genes.gtf.gz Traceback (most recent call last): File "/data/users/robin.garcia/miniconda3/envs/cellcycle/bin/velocyto", line 11, in sys.exit(cli()) File "/data/users/robin.garcia/miniconda3/envs/cellcycle/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/data/users/robin.garcia/miniconda3/envs/cellcycle/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/data/users/robin.garcia/miniconda3/envs/cellcycle/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/data/users/robin.garcia/miniconda3/envs/cellcycle/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/data/users/robin.garcia/miniconda3/envs/cellcycle/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "/data/users/robin.garcia/miniconda3/envs/cellcycle/lib/python3.10/site-packages/velocyto/commands/run10x.py", line 112, in run10x return _run(bamfile=(bamfile, ), gtffile=gtffile, bcfile=bcfile, outputfolder=outputfolder, File "/data/users/robin.garcia/miniconda3/envs/cellcycle/lib/python3.10/site-packages/velocyto/commands/_run.py", line 186, in _run annotations_by_chrm_strand = exincounter.read_transcriptmodels(gtffile) File "/data/users/robin.garcia/miniconda3/envs/cellcycle/lib/python3.10/site-packages/velocyto/counter.py", line 463, in read_transcriptmodels gtf_lines = [line for line in open(gtf_file) if not line.startswith('#')] File "/data/users/robin.garcia/miniconda3/envs/cellcycle/lib/python3.10/site-packages/velocyto/counter.py", line 463, in gtf_lines = [line for line in open(gtf_file) if not line.startswith('#')] File "/data/users/robin.garcia/miniconda3/envs/cellcycle/lib/python3.10/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

wchenqi commented 8 months ago

I have the same error! Have you figured out how to solve this problem?

mortunco commented 4 months ago

I checked everyone errors here. @robingarcia and I had the same situation where the repeatmasker GTF is zipped. Make sure all GTFs are unzipped. For those ones that are not ZIPPED, maybe check if there is any corruptions in the file.

velocyto run10x -m resources/mm10_repeatmasker_UCSC.gtf CR resources/refdata-cellranger-mm10-1.2.0/genes/genes.gtf --dtype uint32

Cheers.

BOOS-TNT commented 3 months ago

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa7 in position 0: invalid start byte