smithlabcode / falco

A C++ drop-in replacement of FastQC to assess the quality of sequence read data
https://falco.readthedocs.io
GNU General Public License v3.0
90 stars 10 forks source link

Error when processing bam file: No known encoding with chars < 33. Yours was 9) #34

Closed msnyder424 closed 2 years ago

msnyder424 commented 2 years ago

I get the error, "No known encoding with chars < 33. Yours was 9)" when I try to process a bam file with falco. Here is the call and stdout:

$ falco ${BAM_FILE} -o falco_fastqc/
[limits]        using default limit cutoffs (no file specified)
[adapters]      using default adapters (no file specified)
[contaminants]  using default contaminant list (no file specified)
[Fri May 13 08:09:56 2022] Started reading file DS-376333.hg19.bam
[Fri May 13 08:09:56 2022] reading file as bam format
[Fri May 13 08:10:00 2022] Processed 1M reads
[Fri May 13 08:10:05 2022] Processed 2M reads
[Fri May 13 08:10:10 2022] Processed 3M reads
[Fri May 13 08:10:15 2022] Processed 4M reads
[Fri May 13 08:10:19 2022] Processed 5M reads
[Fri May 13 08:10:21 2022] Finished reading file
[Fri May 13 08:10:21 2022] Writing text report to falco_fastqc//fastqc_data.txt
[Fri May 13 08:10:21 2022] Writing HTML report to falco_fastqc//fastqc_report.html
No known encoding with chars < 33. Yours was 9)

This is similar to issue #24 but not the same.

The 9 must be referring to the ASCII quality scores. 9 is a TAB (\t).

samtools view DS-376333.hg19.bam | grep -P "\t" | wc -l shows me that every line in ${BAM_FLE} has a \t in it, which makes sense because BAMs are "tab delimited". So i'm not sure how to even find the offending \t that I imaging must be at the beginning, end, or middle of the quality scores.

However, all my BAMs were created with the GATK best practices pipeline, so I don't see how they could be poorly formatted. Additionally, fastqc is able to process them without error, albeit very slowly.

Thanks for any help!

msnyder424 commented 2 years ago

Additional info: I originally noticed this issue when running in a Docker I pulled from quay: docker pull quay.io/biocontainers/falco:0.3.0--h5aa19ff_1

I have the same error when run with falco installed with conda:

conda install -c bioconda falco
conda update falco # due to out of date version in conda. update installs 0.3.0

For the life of me I cannot get falco to read bam files when I build from the source code downloaded from the releases or from a git clone of the repo. I get this error every time: Cannot recognize file format for file /test_bam/DS-376302.hg19.bam

htslib is installed and I ran the below after installing:

make HAVE_HTSLIB=1 all
make HAVE_HTSLIB=1 install

It happens with multiple different bam files I use as input.

guilhermesena1 commented 2 years ago

Hello,

That's weird indeed. The "cannot recognize file format" should only occur upon compilation without HTSLib. The conda recipe has HTS as dependency and the compile instructions should be done with HTSLib. If there is a problem with a path to HTSLib the compilation should fail, and if it doesn't I need to look into why that's happening.

Would you be able to provide a small BAM file in which I could try to reproduce the character encoding bug? At least from my source compile it seems to be working with BAM files but I can imagine there may be some problem with tabs being added to the QUAL string. Thank you!

guilhermesena1 commented 2 years ago

Oh one additional thing that my explain the problem in your first comment. The command

falco ${BAM_FILE} -o falco_fastqc

should be

falco  -o falco_fastqc ${BAM_FILE}

The last arg should be the output directory, otherwise it may interpret "-o" and "falco_fastq" as other input files to process. Not sure if this changes the outcome. In fact I'm surprised the command works as expected

msnyder424 commented 2 years ago

Thanks for the help!

I agree that the command you suggested actually follows the usage in the falco help menu. This was a hold over from an old WDL task that ran fastqc. But alas, changing the command did not do the trick.

I should not have muddied the waters talking about the "Cannot recognize file format" error. That only happens when I try to use falco installed from a repo clone or source code zip. I believe the Docker image uses an instance installed with conda. When I execute with that Docker or an instance installed with conda, the program recognizes the file format. We can keep the discussion to only the issue in the first comment.

Happy to send a small BAM for you to investigate. Can't seem to attach a zip to this comment. Where should I send it?

guilhermesena1 commented 2 years ago

You can send it to desenabr[at]usc[dot]edu and I can look into it.

guilhermesena1 commented 2 years ago

Thanks for sharing the file! I think I see the problem, and indeed it reflects a bigger issue with falco. The \t was because falco was reading the optional SAM/BAM tags after the quality line as part of the quality scores.

I did a bunch of rewriting on the SAM/BAM processing functions to address this issues. Would you be able to pull from the falco repo and re-test if it runs to completion on your BAM files? Thank you so much!!

msnyder424 commented 2 years ago

Thanks for the fix!

Here is what I ran to install from a clone:

git clone https://github.com/smithlabcode/falco.git
cd falco
sudo make all
sudo make install
sudo make HAVE_HTSLIB=1 all
sudo make HAVE_HTSLIB=1 install

When I clone the repo there is no configure. So I cannot run this part of the instructions: ./configure CXXFLAGS="-O3 -Wall" --enable-hts I have htslib installed but I get this error every time I run falco with a bam file: "Cannot recognize file format for file /home/dnanexus/DS-376333.hg19.bam"

I tried to create configure by running aclocal, autoconf, and automake --add-missing, but the last command throws this error: "configure.ac:21: error: required file 'config.h.in' not found"

Not sure if I'm doing the right things here...

msnyder424 commented 2 years ago

OK. After fighting with my htslib installation, I finally got it to work. Looks like the update worked! THANK YOU!

In the end I got it to work with:

git clone https://github.com/smithlabcode/falco.git
cd falco
sudo make HAVE_HTSLIB=1 all
sudo make HAVE_HTSLIB=1 install

But then I got this error from falco: "bin/falco: error while loading shared libraries: libhts.so.3: cannot open shared object file: No such file or directory" This missing dependency was in /usr/local/lib/libhts.so.3.

As a hack, I just linked all the libs in that dir to /usr/lib/.

I don't think that was the best way around that, but I'm not sure what else to do. Is it possible falco can and should be configured to look for that dependency in multiple locations?

I installed htslib like so:

HTS_LIB_VERSION=1.15.1
wget https://github.com/samtools/htslib/releases/download/${HTS_LIB_VERSION}/htslib-${HTS_LIB_VERSION}.tar.bz2
tar xf htslib-${HTS_LIB_VERSION}.tar.bz2
cd htslib-${HTS_LIB_VERSION}
./configure
sudo make
sudo make install
msnyder424 commented 2 years ago

And I figured out the fix to the htslib install. ./configure --prefix=/usr/

Case closed. Good to release I think!

Thanks for all your work! Any idea when we can expect a release?

guilhermesena1 commented 2 years ago

Glad to know it's working!

Just for the record, if you clone from repo then these two commands should suffice

make HAVE_HTSLIB=1 all
make HAVE_HTSLIB=1  install

you don't need autotools to compile. I find it strange that the program compiles successfully but doesn't find HTSLib. If the -DUSE_HTS flag is there on compilation, it should either fail to compile if htslib is not on your $LIBRARY_PATH or it should compile successfully and identify BAM files. The behavior you describe is definitely puzzling but we can discuss this in another issue.

Is it ok to close this?

msnyder424 commented 2 years ago

I’ll close it! Any timeline on a new release?