liaoherui / StrainScan

High-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers
https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-023-01615-w
MIT License
32 stars 4 forks source link

paired-end, interleaved, or just read1? #2

Closed nick-youngblut closed 1 year ago

nick-youngblut commented 1 year ago

The README docs are not clear about what reads can be used: paired-end (separate files for each in the pair), interleaved (1 files for all paired-end reads), or just read1.

It appears that GCF_003812785.fq includes both read pairs (only fastq header format), but the reads are not interleaved: all read2 sequences are just appended below read1.

Lastly, does StrainScan.py accept compressed read files?

Given that uncompressing, interleaving, or other operations for large read datasets is computationally expensive (and requires more code), it would be quite helpful if the user has flexibility on the input.

nick-youngblut commented 1 year ago

It appears that strainscan_build cannot handle compressed (gzip'ed) genome files. The error generated:

Dashing version: v0.4.7-2-g6ad3
  Traceback (most recent call last):
    File "/opt/conda/bin/strainscan_build", line 10, in <module>
      sys.exit(main())
    File "/opt/conda/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 161, in main
      Build_tree.build_tree([cls_res+'/distance_matrix.txt',cls_res+'/hclsMap_95_recls.txt',out_dir+'/Tree_database',31,params])
    File "/opt/conda/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 275, in build_tree
      kmer_index = extract_kmers(fna_mapping[i.identifier], fna_path, ksize, kmer_index_dict, kmer_index, Lv, spec, tree_dir, alpha_ratio, i.identifier)
    File "/opt/conda/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 92, in extract_kmers
      for seq_record in SeqIO.parse(fna_path[j], "fasta"):
    File "/opt/conda/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 661, in parse
      for r in i:
    File "/opt/conda/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py", line 176, in FastaIterator
      for title, sequence in SimpleFastaParser(handle):
    File "/opt/conda/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py", line 45, in SimpleFastaParser
      for line in handle:
    File "/opt/conda/lib/python3.7/codecs.py", line 322, in decode
      (result, consumed) = self._buffer_decode(data, self.errors, final)
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

It would help to include support for gzip'd genome fasta files. For example:

with gzip.open("genome.fasta.gz", "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        print(record.id)
liaoherui commented 1 year ago

Hi, Thanks a lot for your valuable suggestions! I will update this function asap.

liaoherui commented 1 year ago

Hi, The function is updated now! You can take gzipped PE fastq data as input now. (One example command can be found below).

python StrainScan.py -i GCF_003812785_1.fq.gz -j GCF_003812785_2.fq.gz -d DB_Small -o Test_Sim/GCF_003812785

liaoherui commented 1 year ago

Just notice you also mentioned taking gzipped genome fasta files as input.

I will update this function asap.

liaoherui commented 1 year ago

Hi, I have added this new function. Now, StrainScan can also take gzipped fasta files as input to build the database.