phac-nml / mob-suite

MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies
Apache License 2.0
124 stars 33 forks source link

Mob_Recon fails with compressed input #153

Open jvfe opened 1 year ago

jvfe commented 1 year ago

Hi,

I'm using mob_recon (v3.1.7) on some assemblies and I've noticed that it fails when using a gzip-compressed file and succeeds when using the same file, but decompressed. It looks to be some error related to utf-8 encoding.

Is this expected and is there any way to circumvent this other than decompressing my assemblies? I have over 8000 assemblies so I'm hoping to avoid having to decompress all of them.

Command used

mob_recon --infile SAMD00000756.contigs.fa.gz --num_threads 6 \
--sample_id SAMD00000756 --unicycler_contigs \
--outdir SAMD00000756_mob_recon --debug \
--run_overhang
Error log ``` 2023-11-09 16:27:35,689 mob_suite.mob_recon INFO: MOB-recon version 3.1.7 [in /home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/mob_suite/mob_recon.py:981] 2023-11-09 16:27:35,689 mob_suite.mob_recon DEBUG: Debug log reporting set on successfully [in /home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/mob_suite/mob_recon.py:982] 2023-11-09 16:27:35,689 mob_suite.mob_recon INFO: SUCCESS: Found program blastn at /home/jvfe/miniconda3/envs/mobsuite/bin/blastn [in /home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/mob_suite/utils.py:592] 2023-11-09 16:27:35,689 mob_suite.mob_recon INFO: SUCCESS: Found program makeblastdb at /home/jvfe/miniconda3/envs/mobsuite/bin/makeblastdb [in /home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/mob_suite/utils.py:592] 2023-11-09 16:27:35,689 mob_suite.mob_recon INFO: SUCCESS: Found program tblastn at /home/jvfe/miniconda3/envs/mobsuite/bin/tblastn [in /home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/mob_suite/utils.py:592] 2023-11-09 16:27:35,689 mob_suite.mob_recon INFO: Processing fasta file SAMD00000756.contigs.fa.gz [in /home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/mob_suite/mob_recon.py:1008] 2023-11-09 16:27:35,689 mob_suite.mob_recon INFO: Analysis directory SAMD00000756_mob_recon [in /home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/mob_suite/mob_recon.py:1009] 2023-11-09 16:27:40,596 mob_suite.mob_recon INFO: Writing cleaned header input fasta file from SAMD00000756.contigs.fa.gz to SAMD00000756_mob_recon/__tmp/fixed.input.fasta [in /home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/mob_suite/mob_recon.py:1104] Traceback (most recent call last): File "/home/jvfe/miniconda3/envs/mobsuite/bin/mob_recon", line 10, in sys.exit(main()) File "/home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/mob_suite/mob_recon.py", line 1105, in main id_mapping = fix_fasta_header(input_fasta, fixed_fasta) File "/home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/mob_suite/utils.py", line 820, in fix_fasta_header for record in SeqIO.parse(handle, "fasta"): File "/home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 72, in __next__ return next(self.records) File "/home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/Bio/SeqIO/FastaIO.py", line 238, in iterate for title, sequence in SimpleFastaParser(handle): File "/home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/site-packages/Bio/SeqIO/FastaIO.py", line 50, in SimpleFastaParser for line in handle: File "/home/jvfe/miniconda3/envs/mobsuite/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte ```

Running gunzip SAMD00000756.contigs.fa.gz and then re-running the command above works as expected.

I've attached the assembly below. SAMD00000756.contigs.fa.gz

kbessonov1984 commented 1 year ago

Hello, MOB-Suite tools do not support compressed inputs at the moment. The mob_recon fails to read expected fasta text file as it gets instead a compressed gzip file. I know that gzipped compressed genomes take significantly less space and support of the compressed inputs is a convenience feature, but is low priority for us. Let's just keep this issue open as a reminder for us and as a feature request.

For now please uncompress inputs before running MOB-Suite tools. If space is a limitation, you can temporary decompress inputs, run MOB-Suite tools and then erase decompressed inputs. You can write simple bash or python script or implement it as a NextFlow pipeline.