smirarab / sepp

Ensemble of HMM methods (SEPP, TIPP, UPP)
GNU General Public License v3.0
89 stars 38 forks source link

BLAST query error: CFastaReader #105

Closed nick-youngblut closed 3 years ago

nick-youngblut commented 3 years ago

I'm running run_abundance.py in tipp2, and the tipp2 tutorial states:

The input fragment files must be in FASTA or FASTQ formats with the following extenstions: .fastq or .fq for FASTQ files .fasta, .fas, .fa, .fna for FASTA files The output will be tab delimited files that estimate the abundance at a given taxonomic level.

However, if I run run_abundance.py with my read file in fasta format (reads.fq or reads.fastq), I get the following:

[21:16:22] config.py (line 370):     INFO: Seed number: 297834
Blasting fragments against marker dataset

/ebio/abt3_projects/software/dev/miniconda3_dev/envs/tipp2/bin/blastn -db /ebio/abt3_projects/databases_no-backup/SEPP/tipp/tipp2-refpkg/markers-v1/blast/alignment.fasta.db -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send slen evalue bitscore" -query read1.fq -out /ebio/abt3_projects/software/dev/ll_pipelines/llmgp/tmp/tipp2/sepp/tmp/tipp2_tmp/tmpdxw0yaek/blast.out -num_threads 8
BLAST query error: CFastaReader: Near line 1, there's a line that doesn't look like plausible data, but it's not marked as defline or comment.
Unable to bin any fragments!

If I convert those reads to a fasta (reads.fas), run_abundance.py completes successfully.

It appears that run_abundance.py can't actually use a fastq.

It would be nice if TIPP2 could read in gzip'ed files, given that many/most users keep reads compressed.

ekmolloy commented 3 years ago

Hello, It sounds like there is one issue and one feature request here. The issue is that TIPP isn't working on FASTQ-formatted files (throws a BLAST error). The feature request is that TIPP accept gzip'ed files (either FASTA or FASTQ) as input. Is this correct? For the issue, could you please send a small example file. Thank you!

nick-youngblut commented 3 years ago

The bug is that the TIPP documentation states that it can accept fastq files as input, but it appears that it can't. Any Illumina fastq file from the ENA or SRA could be used for testing that.

The other issue is that TIPP seems to require a specific naming of the sequence headers, but TIPP does not edit the header IDs, and instead just throws errors if the headers are formatted incorrectly. The fastq file(s) used for Issue 1 can be used for this 2nd issue.

ekmolloy commented 3 years ago

As you note, this is a follow-up on the issue about what file formats are allowed as input. We now clarify in the tutorial that "The input fragment files must be in a format accepted by BLAST (i.e. a decompressed FASTA file with no spaces in the read names)."