Closed DamienWaits closed 6 years ago
Hi @DamienWaits, can you run the following for me so I can see what the problem might be?
samtools view -H reads.bam
and then also
grep '^>' genome.fasta
Perhaps there is a mistake in the code, but the BamHeader check is only checking the headers found in the BAM file against those in the genome.
Also, the log file should be telling you which headers are causing a problem. see output_dir/logfiles/funannotate-predict.log
Per the using bam file in funannotate. When you pass a BAM file it will trigger funannotate to use BRAKER1 to train Augustus and GeneMark-ET. This theoretically will lead to better ab initio gene predictions from those two tools. If you then also pass PASA models, then all of those prediction data are fed into Evidence Modeler (where PASA models are given a much higher weight than the ab initio predictions and the transcript/protein alignments).
Thanks for the quick reply!
Samtools returned lines of the form: @SQ SN:Contig0 LN:17053009
The headers in the genome are in this format:
Contig0
The log file is full of lines like this: ,@,S,Q, ,S,N,:,C,o,n,t,i,g,0, ,L,N,:,1,7,0,5,3,0,0,9,
The bam was generated using HISAT, if that helps.
I'll get you a link to some small test files in a few minutes? But just to clarify, you are running newest version (0.7.2)?
Great, thank you.
The version on the cluster I'm working on is 0.6.1.
Okay maybe that is the problem then, are you able to upgrade to newest version? I don't see a change in the code since then in relation to the bam parser/header test...but will be easier to perhaps trouble shoot with a smaller test version.
Sure thing!
Here is a small test case, should be able to run like this: https://osf.io/q35ax/download?version=2
$ samtools view -H genome5.bam
@HD VN:1.0 SO:coordinate
@SQ SN:Contig245 LN:53463
@SQ SN:Contig246 LN:56115
@SQ SN:Contig247 LN:34602
@SQ SN:Contig250 LN:54634
@SQ SN:Contig260 LN:63141
@SQ SN:Contig270 LN:73943
@PG ID:hisat2 PN:hisat2 VN:2.0.5 CL:"/usr/local/bin/../Cellar/hisat2/2.0.5/bin/hisat2-align-s --wrapper basic-0 -x genome5 -1 genome5_R1.fq -2 genome5_R2.fq"
$ grep '^>' genome5.fasta
>Contig245
>Contig246
>Contig247
>Contig250
>Contig260
>Contig270
>Contig280
>Contig352
$ funannotate predict -i genome5.fasta -s "Genome five" -o genome5_test \
--protein_evidence proteins.fa --transcript_evidence genome5.transcripts.fa \
--rna_bam genome5.bam --cpus 6
I went ahead and ran the test data through version 0.6.1 and got the same error. I will get our version updated and try the test again.
Okay, let me know if that fixes it or not. I have a feeling it may not, but best to have the latest version. Based on what you described in the log file: @,S,Q, ,S,N,:,C,o,n,t,i,g,0, ,L,N,:,1,7,0,5,3,0,0,9,
this suggests that the pybam parser is not returning a list of chromosome names as it does on my system.
If you upgrade and still doesn't work, then if you could replace the /path/to/funannotate/lib/library.py
with the attached version (note drop the .txt from the file name, only way I can attach to GitHub message). This version of the library script is testing if the bam parser is returning a list or not, if it is not it will error out and print to terminal what it did return -> perhaps that will tell us what is wrong.
library.py.txt
The update fixed the header problem. Thanks for taking the time to help me out.
Great, thanks for letting me know.
Hi nextgenusfs,
I'm getting the following error when I run funannotate predict: "Fasta headers in BAM file do not match genome, exiting." The command looks like this:
I've double and triple checked and all the headers in the bam file are present in the genome fasta file. Not all headers in the genome file are present in the bam file, but I assumed this would not be a problem.
Has anyone run into this issue? Also, does providing funannotate with a bam file help train augustus if I'm already using transcripts and a gff3 file that were created using the bam?
Thanks for your time.