nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 84 forks source link

Bam headers do not match genome #107

Closed DamienWaits closed 6 years ago

DamienWaits commented 6 years ago

Hi nextgenusfs,

I'm getting the following error when I run funannotate predict: "Fasta headers in BAM file do not match genome, exiting." The command looks like this:

funannotate predict -i genome.fasta --rna_bam reads.bam -pasa_gff assemblies.transdecoder.gff3 \
   --transcript_evidence transcripts.fasta

I've double and triple checked and all the headers in the bam file are present in the genome fasta file. Not all headers in the genome file are present in the bam file, but I assumed this would not be a problem.

Has anyone run into this issue? Also, does providing funannotate with a bam file help train augustus if I'm already using transcripts and a gff3 file that were created using the bam?

Thanks for your time.

nextgenusfs commented 6 years ago

Hi @DamienWaits, can you run the following for me so I can see what the problem might be?

samtools view -H reads.bam

and then also

grep '^>' genome.fasta

Perhaps there is a mistake in the code, but the BamHeader check is only checking the headers found in the BAM file against those in the genome.

nextgenusfs commented 6 years ago

Also, the log file should be telling you which headers are causing a problem. see output_dir/logfiles/funannotate-predict.log

nextgenusfs commented 6 years ago

Per the using bam file in funannotate. When you pass a BAM file it will trigger funannotate to use BRAKER1 to train Augustus and GeneMark-ET. This theoretically will lead to better ab initio gene predictions from those two tools. If you then also pass PASA models, then all of those prediction data are fed into Evidence Modeler (where PASA models are given a much higher weight than the ab initio predictions and the transcript/protein alignments).

DamienWaits commented 6 years ago

Thanks for the quick reply!

Samtools returned lines of the form: @SQ SN:Contig0 LN:17053009

The headers in the genome are in this format:

Contig0

The log file is full of lines like this: ,@,S,Q, ,S,N,:,C,o,n,t,i,g,0, ,L,N,:,1,7,0,5,3,0,0,9,

The bam was generated using HISAT, if that helps.

nextgenusfs commented 6 years ago

I'll get you a link to some small test files in a few minutes? But just to clarify, you are running newest version (0.7.2)?

DamienWaits commented 6 years ago

Great, thank you.

The version on the cluster I'm working on is 0.6.1.

nextgenusfs commented 6 years ago

Okay maybe that is the problem then, are you able to upgrade to newest version? I don't see a change in the code since then in relation to the bam parser/header test...but will be easier to perhaps trouble shoot with a smaller test version.

DamienWaits commented 6 years ago

Sure thing!

nextgenusfs commented 6 years ago

Here is a small test case, should be able to run like this: https://osf.io/q35ax/download?version=2

$ samtools view -H genome5.bam
@HD VN:1.0  SO:coordinate
@SQ SN:Contig245    LN:53463
@SQ SN:Contig246    LN:56115
@SQ SN:Contig247    LN:34602
@SQ SN:Contig250    LN:54634
@SQ SN:Contig260    LN:63141
@SQ SN:Contig270    LN:73943
@PG ID:hisat2   PN:hisat2   VN:2.0.5    CL:"/usr/local/bin/../Cellar/hisat2/2.0.5/bin/hisat2-align-s --wrapper basic-0 -x genome5 -1 genome5_R1.fq -2 genome5_R2.fq"

$ grep '^>' genome5.fasta 
>Contig245  
>Contig246  
>Contig247  
>Contig250  
>Contig260  
>Contig270  
>Contig280  
>Contig352  

$ funannotate predict -i genome5.fasta -s "Genome five" -o genome5_test \
   --protein_evidence proteins.fa --transcript_evidence genome5.transcripts.fa \
   --rna_bam genome5.bam --cpus 6
DamienWaits commented 6 years ago

I went ahead and ran the test data through version 0.6.1 and got the same error. I will get our version updated and try the test again.

nextgenusfs commented 6 years ago

Okay, let me know if that fixes it or not. I have a feeling it may not, but best to have the latest version. Based on what you described in the log file: @,S,Q, ,S,N,:,C,o,n,t,i,g,0, ,L,N,:,1,7,0,5,3,0,0,9, this suggests that the pybam parser is not returning a list of chromosome names as it does on my system.

nextgenusfs commented 6 years ago

If you upgrade and still doesn't work, then if you could replace the /path/to/funannotate/lib/library.py with the attached version (note drop the .txt from the file name, only way I can attach to GitHub message). This version of the library script is testing if the bam parser is returning a list or not, if it is not it will error out and print to terminal what it did return -> perhaps that will tell us what is wrong. library.py.txt

DamienWaits commented 6 years ago

The update fixed the header problem. Thanks for taking the time to help me out.

nextgenusfs commented 6 years ago

Great, thanks for letting me know.