Closed nick-youngblut closed 1 year ago
It appears that strainscan_build
cannot handle compressed (gzip'ed) genome files. The error generated:
Dashing version: v0.4.7-2-g6ad3
Traceback (most recent call last):
File "/opt/conda/bin/strainscan_build", line 10, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 161, in main
Build_tree.build_tree([cls_res+'/distance_matrix.txt',cls_res+'/hclsMap_95_recls.txt',out_dir+'/Tree_database',31,params])
File "/opt/conda/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 275, in build_tree
kmer_index = extract_kmers(fna_mapping[i.identifier], fna_path, ksize, kmer_index_dict, kmer_index, Lv, spec, tree_dir, alpha_ratio, i.identifier)
File "/opt/conda/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 92, in extract_kmers
for seq_record in SeqIO.parse(fna_path[j], "fasta"):
File "/opt/conda/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 661, in parse
for r in i:
File "/opt/conda/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py", line 176, in FastaIterator
for title, sequence in SimpleFastaParser(handle):
File "/opt/conda/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py", line 45, in SimpleFastaParser
for line in handle:
File "/opt/conda/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
It would help to include support for gzip'd genome fasta files. For example:
with gzip.open("genome.fasta.gz", "rt") as handle:
for record in SeqIO.parse(handle, "fasta"):
print(record.id)
Hi, Thanks a lot for your valuable suggestions! I will update this function asap.
Hi, The function is updated now! You can take gzipped PE fastq data as input now. (One example command can be found below).
python StrainScan.py -i GCF_003812785_1.fq.gz -j GCF_003812785_2.fq.gz -d DB_Small -o Test_Sim/GCF_003812785
Just notice you also mentioned taking gzipped genome fasta files as input.
I will update this function asap.
Hi, I have added this new function. Now, StrainScan can also take gzipped fasta files as input to build the database.
The README docs are not clear about what reads can be used: paired-end (separate files for each in the pair), interleaved (1 files for all paired-end reads), or just read1.
It appears that
GCF_003812785.fq
includes both read pairs (only fastq header format), but the reads are not interleaved: all read2 sequences are just appended below read1.Lastly, does
StrainScan.py
accept compressed read files?Given that uncompressing, interleaving, or other operations for large read datasets is computationally expensive (and requires more code), it would be quite helpful if the user has flexibility on the input.