nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
321 stars 85 forks source link

issue with headers? #59

Closed el42008 closed 7 years ago

el42008 commented 7 years ago

Hi,

I got this issue and I wonder if you know what it might be the problem. I think it is something to do with headers because they might be longer than 16. Is there anyway to solve this without having to get new bam files?

Traceback (most recent call last): File "/mnt/apps/funannotate/bin/funannotate-predict.py", line 223, in if not lib.BamHeaderTest(args.input, args.rna_bam): File "/mnt/apps/funannotate/lib/library.py", line 457, in BamHeaderTest bam_file = pybam.bgunzip(bamin) File "/mnt/apps/funannotate/lib/pybam.py", line 88, in init self.header_text = struct.unpack(str(length_of_header)+'s',first_chunk[8:8+length_of_header])[0] struct.error: unpack requires a string argument of length 4908302

Thanks a lot

Elena

nextgenusfs commented 7 years ago

For NCBI submission, the system imposes a fasta header character limit of 16, I did not impose the same strict requirements in funannotate, although it is suppose to warn you if your headers are longer than 16 characters. The idea being is that if you are going to submit to NCBI, which you should, that you will have to change the headers anyway, might as well do it before annotation to make things easier on you later on in the process. The BAM header test here is trying to determine if the FASTA headers used for alignment of RNA-seq data matches those that you passed to funannotate predict. Having said that, I have not seen this error before, but have also not tried to pass it longer headers. What do your fasta headers look like?

nextgenusfs commented 7 years ago

Actually i just looked at the code, currently funannotate is imposing a max fasta header limit of 16 characters. So the script should have yielded an error if your fasta headers are longer than 16 characters.

el42008 commented 7 years ago

Hi Jon,

My longest headers of my assembly are like Super_Scaffold_320 as an example. So I do have headers longer than 16

Thanks

2017-04-03 21:14 GMT+01:00 Jon Palmer notifications@github.com:

Actually i just looked at the code, currently funannotate is imposing a max fasta header limit of 16 characters. So the script should have yielded an error if your fasta headers are longer than 16 characters.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-291260073, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWJpXRhjo6GsDGTYkJpzvNn8NhK_Aks5rsVMKgaJpZM4MyBCC .

nextgenusfs commented 7 years ago

I don't know if this is a result of a BAM format problem or the header being too long. The PyBam code that I'm using doesn't seem to suggest that it has a length requirement of 16. What do your BAM headers look like, i.e.

samtools view -H yourbam.bam
el42008 commented 7 years ago

Hi Jon,

I changed the long headers to shorter ones samtools view -h ALLbam.bam | sed "s/Super-Scaffold/SC_/" > ALL_bam_new_headers.bam

samtools view -H ALL_bam_new_headers_final.bam | head @HD VN:1.0 SO:coordinate @SQ SN:SC_53 LN:6369819 @SQ SN:SC_71 LN:8121668 @SQ SN:SC_102 LN:1680451 @SQ SN:SC_104 LN:6050867 @SQ SN:SC_123 LN:3002583 @SQ SN:SC_136 LN:3924724 @SQ SN:SC_140 LN:977997 @SQ SN:SC_156 LN:947802 @SQ SN:SC_170 LN:3205475

But I still get this ERROR:

Traceback (most recent call last): File "/mnt/apps/funannotate/bin/funannotate-predict.py", line 223, in

if not lib.BamHeaderTest(args.input, args.rna_bam): File "/mnt/apps/funannotate/lib/library.py", line 457, in BamHeaderTest bam_file = pybam.bgunzip(bamin) File "/mnt/apps/funannotate/lib/pybam.py", line 88, in __init__ self.header_text = struct.unpack(str(length_of_header)+'s',first_chunk[8:8+length_of_header])[0] struct.error: unpack requires a string argument of length 4906118 2017-04-03 23:33 GMT+01:00 Jon Palmer : > I don't know if this is a result of a BAM format problem or the header > being too long. The PyBam code that I'm using doesn't seem to suggest that > it has a length requirement of 16. What do your BAM headers look like, i.e. > > samtools view -H yourbam.bam > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > , > or mute the thread > > . >
nextgenusfs commented 7 years ago

Interesting, there must be something else wrong with the BAM file. I will have to ask the pyBAM developer what it could be. Open issue here: https://github.com/JohnLonginotto/pybam/issues/2

nextgenusfs commented 7 years ago

How many scaffolds/contigs do you have in your assembly? Follow link above to see why I'm asking

el42008 commented 7 years ago

Hi Jon,

Thank you very much for having a look this.

grep ">" illumina_new_headers.fasta | wc -l 224092

2017-04-05 16:10 GMT+01:00 Jon Palmer notifications@github.com:

How many scaffolds/contigs do you have in your assembly? Follow link above to see why I'm asking

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-291892848, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWKaheJzQIPdLbzwrQNI9RRojEr_4ks5rs675gaJpZM4MyBCC .

el42008 commented 7 years ago

Hi Jon,

If this helps: samtools view -H ALL_bam_new_headers_final.bam | wc -c 4906118

2017-04-05 16:37 GMT+01:00 Elena LOPEZ GIRONA elenalopezgirona@gmail.com:

Hi Jon,

Thank you very much for having a look this.

grep ">" illumina_new_headers.fasta | wc -l 224092

2017-04-05 16:10 GMT+01:00 Jon Palmer notifications@github.com:

How many scaffolds/contigs do you have in your assembly? Follow link above to see why I'm asking

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-291892848, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWKaheJzQIPdLbzwrQNI9RRojEr_4ks5rs675gaJpZM4MyBCC .

nextgenusfs commented 7 years ago

Wow! That is a lot of scaffolds! Must be a really large genome? Otherwise I would consider dropping contigs less than 1 kb, you probably could extend that up to 10 kb if you wanted as you won't be able to predict any coding regions on small contigs anyway.

So likely what seems to be happening is that pybam does a "quick" initial scan of the BAM file to parse the headers, however due to the large number of headers that you have, it is not allocating enough memory. So quick fix is to increase the memory in pybam. I can help you do that manually on the pybam.py file if you want to try that first, will take a few days before I push another update.

You need to navigate to the funannotate directory and then the 'lib' folder. You will then need to edit the pybam.py file at the following on line 142:

142: data = self.file_handle.read(655360)

#change to
142: data = self.file_handle.read(5000000)

Should just be able to save your changes and then give funannotate another try with the same command.

el42008 commented 7 years ago

Thanks!! I will try that, not a problem!

I will let you know how it goes. Yes, it is a big genome (potato) around 850 Mb, but there are larger ones ;-). Yes, you are right I got many small scaffolds of around 200bp that I should remove.

Cheers

2017-04-05 16:46 GMT+01:00 Jon Palmer notifications@github.com:

Wow! That is a lot of scaffolds! Must be a really large genome? Otherwise I would consider dropping contigs less than 1 kb, you probably could extend that up to 10 kb if you wanted as you won't be able to predict any coding regions on small contigs anyway.

So likely what seems to be happening is that pybam does a "quick" initial scan of the BAM file to parse the headers, however due to the large number of headers that you have, it is not allocating enough memory. So quick fix is to increase the memory in pybam. I can help you do that manually on the pybam.py file if you want to try that first, will take a few days before I push another update.

You need to navigate to the funannotate directory and then the 'lib' folder. You will then need to edit the pybam.py file at the following on line 142:

142: data = self.file_handle.read(655360)

change to

142: data = self.file_handle.read(5000000)

Should just be able to save your changes and then give funannotate another try with the same command.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-291904565, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWCf6cuzJGyriBowe-7XY2sgfpFbRks5rs7dygaJpZM4MyBCC .

nextgenusfs commented 7 years ago

Okay. Just remember that headers in BAM must match the genome FASTA headers you pass to funannotate - this is so that BRAKER1 can properly train Augustus/GeneMark. Good luck!

el42008 commented 7 years ago

Yes, that's a good point 😊 I also took into account, so I changed the headers in Fasta file and BAM file and I still have the previous Bam file before changing the headers to check if the lenght it is an issue or not.

2017-04-05 16:53 GMT+01:00 Jon Palmer notifications@github.com:

Okay. Just remember that headers in BAM must match the genome FASTA headers you pass to funannotate - this is so that BRAKER1 can properly train Augustus/GeneMark. Good luck!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-291906798, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWCwZcwFRnmRSS_bk2RgRP5ocO_Yhks5rs7kKgaJpZM4MyBCC .

el42008 commented 7 years ago

Hi Jon,

Sorry for contacting you again! I have changed the setting you toald me in pybam.py but now i get this error Traceback (most recent call last): File "/mnt/apps/funannotate/bin/funannotate-predict.py", line 223, in

if not lib.BamHeaderTest(args.input, args.rna_bam): File "/mnt/apps/funannotate/lib/library.py", line 457, in BamHeaderTest bam_file = pybam.bgunzip(bamin) File "/mnt/apps/funannotate/lib/pybam.py", line 92, in __init__ l_name = struct.unpack(': > Yes, that's a good point 😊 I also took into account, so I changed the > headers in Fasta file and BAM file and I still have the previous Bam file > before changing the headers to check if the lenght it is an issue or not. > > 2017-04-05 16:53 GMT+01:00 Jon Palmer : > >> Okay. Just remember that headers in BAM must match the genome FASTA >> headers you pass to funannotate - this is so that BRAKER1 can properly >> train Augustus/GeneMark. Good luck! >> >> — >> You are receiving this because you authored the thread. >> Reply to this email directly, view it on GitHub >> , >> or mute the thread >> >> . >> > >
nextgenusfs commented 7 years ago

Hi Elena, Sorry about this. I've not seen this error before either, but the pybam developer was going to do an update shortly to get this sorted out. You can bypass this check completely by commenting out the following lines of the code in funannotate-predict.py. Move to the funannotate install folder, then to the bin directory. You will need to then edit the script called funannotate-predict.py. If you comment out lines 222 - 225 (which is the bam header check), then the script will just bypass this check and move on. I will have this fixed in the next release.

el42008 commented 7 years ago

It seems it moved on for masking repeats!! ESC[92m[03:55:27 PM]ESC[0m: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.13 ESC[92m[03:55:27 PM]ESC[0m: Running funannotate v0.6.0 ESC[92m[03:55:31 PM]ESC[0m: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER1 and BUSCO ESC[92m[03:55:37 PM]ESC[0m: Loading sequences and soft-masking genome ESC[92m[03:55:37 PM]ESC[0m: Soft-masking: building RepeatModeler database ESC[92m[03:56:14 PM]ESC[0m: Soft-masking: generating repeat library using RepeatModeler

Thanks, I will keep you posted how it performs for my assembly!

2017-04-07 15:37 GMT+01:00 Jon Palmer notifications@github.com:

Hi Elena, Sorry about this. I've not seen this error before either, but the pybam developer was going to do an update shortly to get this sorted out. You can bypass this check completely by commenting out the following lines of the code in funannotate-predict.py. Move to the funannotate install folder, then to the bin directory. You will need to then edit the script called funannotate-predict.py. If you comment out lines 222 - 225 (which is the bam header check), then the script will just bypass this check and move on. I will have this fixed in the next release.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-292554190, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWKxvaRRr5om61wtgPKY8aZPqvouVks5rtkoRgaJpZM4MyBCC .

nextgenusfs commented 7 years ago

Great, yes I'm interested to see how it does as well. As the name and docs suggest, I work primarily with fungi, so a large potato genome will be a good test to see if the scripts are compatible with any eukaryote. You may also want to adjust a few other settings if you have not done so, I would set the --max_intronlen to something higher, perhaps 10000 is sufficient. And then also switch the --organism flag to other, i.e. --max_intronlen 10000 --organism other

el42008 commented 7 years ago

I have covered that, cool!

funannotate.py predict -i illumina_new_headers.fasta -s potato -o fun_out --rna_bam ALL_bam_new_headers_final.bam --transcript_evidence mikado.loci.final.fasta --protein_evidence Mikado/potato_tomato_prots.fasta --ploidy 2 --optimize_augustus --cpus 1 --organism other --max_intronlen 10000

Thank you

2017-04-07 16:02 GMT+01:00 Jon Palmer notifications@github.com:

Great, yes I'm interested to see how it does as well. As the name and docs suggest, I work primarily with fungi, so a large potato genome will be a good test to see if the scripts are compatible with any eukaryote. You may also want to adjust a few other settings if you have not done so, I would set the --max_intronlen to something higher, perhaps 10000 is sufficient. And then also switch the --organism flag to other, i.e. --max_intronlen 10000 --organism other

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-292561364, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWAFAMAlvUluw9VuY0FGJ0LOIZ1MMks5rtlASgaJpZM4MyBCC .

nextgenusfs commented 7 years ago

only 1 cpu? Will take a long time. Seems like you have 32 cores?

el42008 commented 7 years ago

I changed that! Sorry, I did that when it was crushing😊

2017-04-07 16:13 GMT+01:00 Jon Palmer notifications@github.com:

only 1 cpu? Will take a long time. Seems like you have 32 cores?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-292564421, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWDZI1oZih-2eQWbwV7SzMUrGxLvmks5rtlKtgaJpZM4MyBCC .

el42008 commented 7 years ago

Hi Jon,

I had another error duing masking repeats:

ESC[92m[07:20:16 PM]ESC[0m: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.13 ESC[92m[07:20:16 PM]ESC[0m: Running funannotate v0.6.0 ESC[92m[07:20:21 PM]ESC[0m: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER1 and BUSCO ESC[92m[07:20:30 PM]ESC[0m: Loading sequences and soft-masking genome ESC[92m[07:24:25 PM]ESC[0m: Soft-masking: building RepeatModeler database ESC[92m[07:25:06 PM]ESC[0m: Soft-masking: generating repeat library using RepeatModeler ESC[92m[10:38:30 AM]ESC[0m: Soft-masking: running RepeatMasker with custom library Traceback (most recent call last): File "/mnt/apps/funannotate/bin/funannotate-predict.py", line 258, in

lib.RepeatModelMask(Genome, args.cpus, os.path.join(args.out, 'predict_misc'), MaskGenome, debug) File "/mnt/apps/funannotate/lib/library.py", line 1018, in RepeatModelMask runSubprocess2(cmd, outdir2, log, rm_gff3) File "/mnt/apps/funannotate/lib/library.py", line 83, in runSubprocess2 proc = subprocess.Popen(cmd, cwd=dir, stdout=out, stderr=subprocess.PIPE) File "/mnt/apps/python/2.7/lib/python2.7/subprocess.py", line 390, in __init__ errread, errwrite) File "/mnt/apps/python/2.7/lib/python2.7/subprocess.py", line 1024, in _execute_child raise child_exception OSError: [Errno 13] Permission denied Any idea? Thanks a lot Elena 2017-04-07 16:15 GMT+01:00 Elena LOPEZ GIRONA : > I changed that! Sorry, I did that when it was crushing😊 > > 2017-04-07 16:13 GMT+01:00 Jon Palmer : > >> only 1 cpu? Will take a long time. Seems like you have 32 cores? >> >> — >> You are receiving this because you authored the thread. >> Reply to this email directly, view it on GitHub >> , >> or mute the thread >> >> . >> > >
nextgenusfs commented 7 years ago

There is a log file in the logfiles folder for repeat masking. Any clues in there as to when it crashed? It looks like RepeatModeler finished, but RepeatMasker did not. Although I've not seen a permissions error here before, seems strange. My guess is that it may not be a permission problem. Can you see a file in predict_misc/repeatmodeler.lib?

nextgenusfs commented 7 years ago

Actually it looks like it may be the last script of that function which converts the RepeatMasker result into GFF3 format. Find the path of the RepeatMasker script.

which rmOutToGFF3.pl

What I think could be happening is that this RepeatMasker script is not executable. So you could now check by looking in that directory from above via:

ls -l /path/to/RepeatMasker/util/

If it is executable it should have an -x in the last position of column 1, i.e.

 ls -l /usr/local/Cellar/repeatmasker/4.0.5/libexec/util/
total 384
-rwxr-xr-x  1 jon  admin   4.6K Nov  6  2013 buildRMLibFromEMBL.pl*
-rwxr-xr-x  1 jon  admin    22K Jan 31  2014 buildSummary.pl*
-rwxr-xr-x  1 jon  admin    12K Feb 19  2015 calcDivergenceFromAlign.pl*
-rwxr-xr-x  1 jon  admin    14K Feb 19  2015 createRepeatLandscape.pl*
-rwxr-xr-x  1 jon  admin    38K Nov  6  2013 dupliconToSVG.pl*
-rwxr-xr-x  1 jon  admin   8.6K Nov  6  2013 getRepeatMaskerBatch.pl*
-rwxr-xr-x  1 jon  admin    15K Mar 10  2016 queryRepeatDatabase.pl*
-rwxr-xr-x  1 jon  admin   4.4K Mar 10  2016 queryTaxonomyDatabase.pl*
-rwxr-xr-x  1 jon  admin   4.3K Feb 19  2015 rmOut2Fasta.pl*
-rwxr-xr-x  2 jon  admin   3.7K Mar 10  2016 rmOutToGFF3.pl*
-rwxr-xr-x  1 jon  admin    19K Mar 10  2016 rmToUCSCTables.pl*
-rwxr-xr-x  1 jon  admin    11K Feb 19  2015 trfMask*
-rwxr-xr-x  1 jon  admin   7.7K Feb 19  2015 wublastToCrossmatch.pl*

If it is not executable, you can change that like:

sudo chmod +x /path/to/RepeatMasker/util/rmOutToGFF3.pl
el42008 commented 7 years ago

Hi Jon,

I have checked that in my cluster and it seems fine

[el42208@gruffalo util]$ ls -lht total 96K -rwxr-xr-x 1 root root 4.7K Jun 4 2009 buildRMLibFromEMBL.pl -rwxr-xr-x 1 root root 5.2K Jun 4 2009 calcDivergenceFromAlign.pl -rwxr-xr-x 1 root root 38K Jun 4 2009 dupliconToSVG.pl -rwxr-xr-x 1 root root 15K Jun 4 2009 queryRepeatDatabase.pl -rwxr-xr-x 1 root root 4.4K Jun 4 2009 queryTaxonomyDatabase.pl -rwxr-xr-x 1 root root 3.8K Jun 4 2009 rmOutToGFF3.pl -rwxr-xr-x 1 root root 9.9K May 17 2007 updateLineHash.pl

2017-04-11 15:04 GMT+01:00 Jon Palmer notifications@github.com:

Actually it looks like it may be the last script of that function which converts the RepeatMasker result into GFF3 format. Find the path of the RepeatMasker script.

which rmOutToGFF3.pl

What I think could be happening is that this RepeatMasker script is not executable. So you could now check by looking in that directory from above via:

ls -l /path/to/RepeatMasker/util/

If it is executable it should have an -x in the last position of column 1, i.e.

ls -l /usr/local/Cellar/repeatmasker/4.0.5/libexec/util/ total 384 -rwxr-xr-x 1 jon admin 4.6K Nov 6 2013 buildRMLibFromEMBL.pl -rwxr-xr-x 1 jon admin 22K Jan 31 2014 buildSummary.pl -rwxr-xr-x 1 jon admin 12K Feb 19 2015 calcDivergenceFromAlign.pl -rwxr-xr-x 1 jon admin 14K Feb 19 2015 createRepeatLandscape.pl -rwxr-xr-x 1 jon admin 38K Nov 6 2013 dupliconToSVG.pl -rwxr-xr-x 1 jon admin 8.6K Nov 6 2013 getRepeatMaskerBatch.pl -rwxr-xr-x 1 jon admin 15K Mar 10 2016 queryRepeatDatabase.pl -rwxr-xr-x 1 jon admin 4.4K Mar 10 2016 queryTaxonomyDatabase.pl -rwxr-xr-x 1 jon admin 4.3K Feb 19 2015 rmOut2Fasta.pl -rwxr-xr-x 2 jon admin 3.7K Mar 10 2016 rmOutToGFF3.pl -rwxr-xr-x 1 jon admin 19K Mar 10 2016 rmToUCSCTables.pl -rwxr-xr-x 1 jon admin 11K Feb 19 2015 trfMask -rwxr-xr-x 1 jon admin 7.7K Feb 19 2015 wublastToCrossmatch.pl*

If it is not executable, you can change that like:

sudo chmod +x /path/to/RepeatMasker/util/rmOutToGFF3.pl

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-293273716, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWP-buaz5vTND01dqP3ko0i6Orss5ks5ru4iHgaJpZM4MyBCC .

nextgenusfs commented 7 years ago

Okay. So that apparently isn't it, how about the repeatmodeler.lib file, was that created successfully?

nextgenusfs commented 7 years ago

Have a look in the logfiles and see if you can re-issue the command that failed outside of funannotate. Sometimes the error messages don't get parsed correctly into the log file.

el42008 commented 7 years ago

I do not find repeatmodeler.lib in my working directory. Should I see it in there?

2017-04-11 20:21 GMT+01:00 Jon Palmer notifications@github.com:

Have a look in the logfiles and see if you can re-issue the command that failed outside of funannotate. Sometimes the error messages don't get parsed correctly into the log file.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-293373013, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWKw4LDKpUyUdXDmFXdq6JSzVkC1tks5ru9K0gaJpZM4MyBCC .

nextgenusfs commented 7 years ago

It should be in output_folder/predict_misc/repeatmodeler.lib.

el42008 commented 7 years ago

ahh okay, my fault because the actual directory did not change from the date I did not checked there. I checked ../output_foler

The gff is empty so, that did not work, but it masked the assembly because I see regions that are not capital bases

This is what I have at the moment inside of output folder

-rw-r--r-- 1 el42208 cms 1.5K Apr 11 20:37 gmap-map.log
-rw-r--r-- 1 el42208 cms 1.5M Apr 11 20:37 transcript_alignments.gff3
-rw-r--r-- 1 el42208 cms  32K Apr 11 20:29 gmap-build.log
drwxr-xr-x 3 el42208 cms   20 Apr 11 20:29 genome
-rw-r--r-- 1 el42208 cms 162M Apr 11 20:11 transcripts.combined.fa
-rw-r--r-- 1 el42208 cms 700M Apr 11 20:11 genome.fasta
-rw-r--r-- 1 el42208 cms 714M Apr 11 13:50 genome.softmasked.fa
-rw-r--r-- 1 el42208 cms    0 Apr 11 13:50 repeatmasker.gff3
drwxr-xr-x 2 el42208 cms    4 Apr 11 13:50 RepeatMasker
drwxr-xr-x 3 el42208 cms    9 Apr 11 10:38 RepeatModeler
-rw-r--r-- 1 el42208 cms 1.4M Apr 11 10:38 repeatmodeler.lib.fa

2017-04-11 20:40 GMT+01:00 Jon Palmer notifications@github.com:

It should be in output_folder/predict_misc/repeatmodeler.lib.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-293377965, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWIdfaKaxzUA8IyZNlMD6Y1xKU7xbks5ru9dBgaJpZM4MyBCC .

nextgenusfs commented 7 years ago

Okay, can you try this command then:

rmOutToGFF3.pl outputfolder/predict_misc/RepeatMasker/genome.fasta.out
el42008 commented 7 years ago

Hi Jon,

when I run

rmOutToGFF3.pl outputfolder/predict_misc/RepeatMasker/genome.fasta.out manually it works, but when I try ti run funanotate pipeline

I get an empty Repeatmasker.gff file. In adittion I get the following error: ESC[92m[02:43:23 PM]ESC[0m: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.13 ESC[92m[02:43:23 PM]ESC[0m: Running funannotate v0.6.0 ESC[92m[02:43:40 PM]ESC[0m: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER1 and BUSCO ESC[92m[02:44:04 PM]ESC[0m: Masked genome: 224,092 scaffolds; 730,142,347 bp ESC[92m[02:44:10 PM]ESC[0m: Using existing transcript evidence alignments ESC[92m[02:44:16 PM]ESC[0m: Using existing protein evidence alignments ESC[92m[02:44:16 PM]ESC[0m: Mapping proteins to genome using tBlastn/Exonerate Traceback (most recent call last): File "/mnt/apps/funannotate/bin/funannotate-p2g.py", line 46, in diamond_version = subprocess.Popen(['diamond', '--version'], stdout=subprocess.PIPE).communicate()[0].split('\n')[0] File "/mnt/apps/python/2.7/lib/python2.7/subprocess.py", line 390, in init errread, errwrite) File "/mnt/apps/python/2.7/lib/python2.7/subprocess.py", line 1024, in _execute_child raise child_exception OSError: [Errno 13] Permission denied ESC[92m[02:44:24 PM]ESC[0m: Now launching BRAKER to train GeneMark and Augustus Traceback (most recent call last): File "/mnt/apps/funannotate/bin/funannotate-predict.py", line 504, in Option2 = '--BAMTOOLS_PATH=' + BAMTOOLS_PATH NameError: name 'BAMTOOLS_PATH' is not defined

2017-04-11 20:51 GMT+01:00 Jon Palmer notifications@github.com:

Okay, can you try this command then:

rmOutToGFF3.pl outputfolder/predict_misc/RepeatMasker/genome.fasta.out

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-293380449, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWLSsH_SasOgbKW8RqIaL9tXxF4PSks5ru9mmgaJpZM4MyBCC .

hyphaltip commented 7 years ago

Isn't that a diamond error now? what happens when you do which diamond or diamond --version

el42008 commented 7 years ago

I am afraid I do not have that installed in my cluster. Do you think that is the problem then?

Cheers

2017-04-13 16:33 GMT+01:00 Jason Stajich notifications@github.com:

Isn't that a diamond error now? what happens when you do which diamond or diamond --version

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-293930411, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWLCKWh_7HNMWYUOaUE1Fe10_vL9-ks5rvkAugaJpZM4MyBCC .

el42008 commented 7 years ago

Hi Jon

What is diamond? Do I need to installed or with blastn is enough?

ESC[92m[03:29:43 PM]ESC[0m: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.13 ESC[92m[03:29:44 PM]ESC[0m: Running funannotate v0.6.0 ESC[92m[03:29:47 PM]ESC[0m: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER1 and BUSCO ESC[92m[03:30:05 PM]ESC[0m: Masked genome: 224,092 scaffolds; 730,142,347 bp ESC[92m[03:30:12 PM]ESC[0m: Aligning transcript evidence to genome with GMAP ESC[92m[04:18:52 PM]ESC[0m: Mapping proteins to genome using tBlastn/Exonerate Traceback (most recent call last): File "/mnt/apps/funannotate/bin/funannotate-p2g.py", line 46, in diamond_version = subprocess.Popen(['diamond', '--version'], stdout=subprocess.PIPE).communicate()[0].split('\n')[0] File "/mnt/apps/python/2.7/lib/python2.7/subprocess.py", line 390, in init errread, errwrite) File "/mnt/apps/python/2.7/lib/python2.7/subprocess.py", line 1024, in _execute_child raise child_exception OSError: [Errno 13] Permission denied ESC[92m[04:18:53 PM]ESC[0m: Mapping proteins to genome using tBlastn/Exonerate Traceback (most recent call last): File "/mnt/apps/funannotate/bin/funannotate-p2g.py", line 46, in diamond_version = subprocess.Popen(['diamond', '--version'], stdout=subprocess.PIPE).communicate()[0].split('\n')[0] File "/mnt/apps/python/2.7/lib/python2.7/subprocess.py", line 390, in init errread, errwrite) File "/mnt/apps/python/2.7/lib/python2.7/subprocess.py", line 1024, in _execute_child raise child_exception OSError: [Errno 13] Permission denied ESC[92m[04:18:54 PM]ESC[0m: Now launching BRAKER to train GeneMark and Augustus Traceback (most recent call last): File "/mnt/apps/funannotate/bin/funannotate-predict.py", line 504, in

Option2 = '--BAMTOOLS_PATH=' + BAMTOOLS_PATH NameError: name 'BAMTOOLS_PATH' is not defined 2017-04-13 18:58 GMT+01:00 Elena LOPEZ GIRONA : > I am afraid I do not have that installed in my cluster. Do you think that > is the problem then? > > Cheers > > > 2017-04-13 16:33 GMT+01:00 Jason Stajich : > >> Isn't that a diamond error now? >> what happens when you do >> which diamond >> or >> diamond --version >> >> — >> You are receiving this because you authored the thread. >> Reply to this email directly, view it on GitHub >> , >> or mute the thread >> >> . >> > >
nextgenusfs commented 7 years ago

Hi Elena, Sorry I've been traveling for a few days. What version of linux are you on? It seems to me like the Python subprocess module is not acting the way it does on Mac or Ubuntu, it almost seems like it is launching a sub shell that does not have the same permissions and/or potentially same $PATH environment. I do not see this behavior on the Docker image. There are a few things that I can think of to try, but would be helpful to figure out if there is some nuances with the OS/python version you have with the subprocess module.

The BAMTOOLS_PATH error seems like it could be related, do you have the $BAMTOOLS_PATH environmental variable set in your ~/.bashrc? i.e. echo $BAMTOOLS_PATH

el42008 commented 7 years ago

Hi Jon,

I got much forward with the pipeline but I got some issues that I think are coming from Evidence Modeler

This is braker.log:

NEXT STEP: check files and settings NEXT STEP: check options ... options check complete. NEXT STEP: check fasta headers WARNING: Detected whitespace in fasta header of file /mnt/shared/users/el42208/RNA_data/funannotate/fun_out/predictmisc/genome.softmasked.fa. This may later on cause problems! The pipeline will create a new file without spaces or "|" characters and a header.map file to look up the old and new headers. This message will be suppressed from now on! fasta headers check complete. NEXT STEP: create SAM header file /mnt/shared/users/el42208/RNA data/funannotate/braker/potato/ALL_bam_new_headers_final_header.sam. SAM file /mnt/shared/users/el42208/RNA_data/funannotate/braker/ potato/ALL_bam_new_headers_finalheader.sam complete. NEXT STEP: check BAM headers headers check for BAM file /mnt/shared/users/el42208/RNA data/funannotate/ALL_bam_new_headersfinal.bam complete. NEXT STEP: make hints from BAM file /mnt/shared/users/el42208/RNA data/funannotate/ALL_bam_new_headers_final.bam failed to execute: Inappropriate ioctl for device (END)

this is teh error file

ESC[92m[03:50:14 PM]ESC[0m: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.13 ESC[92m[03:50:14 PM]ESC[0m: Running funannotate v0.6.0 ESC[92m[03:50:16 PM]ESC[0m: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER1 and BUSCO ESC[92m[03:50:36 PM]ESC[0m: Masked genome: 224,092 scaffolds; 730,142,347 bp ESC[92m[03:50:42 PM]ESC[0m: Using existing transcript evidence alignments ESC[92m[03:50:57 PM]ESC[0m: Using existing protein evidence alignments ESC[92m[03:51:04 PM]ESC[0m: Now launching BRAKER to train GeneMark and Augustus ESC[92m[03:51:04 PM]ESC[0m: GeneMark predictions failed, proceeding with only Augustus ESC[92m[03:51:04 PM]ESC[0m: Augustus prediction failed, check logfiles/augustus-parallel.log

The Augustus-parallel.log is not produced and genemark.gff file is only 216 bytes long so whatwever produces this is failing. augustus.evm.gff3is also empty.

I also see this message in /home/el42208/RNA_data/funannotate/fun_out/predict_misc/braker/potato/errors

/mnt/apps/augustus/3.2.3/config/../bin/bam2hints: error while loading shared libraries: libbamtools.so.2.4.0: cannot open shared object file: No such file or directory

Do you have any idea what would be the problem?

Cheers

Elena

2017-04-14 21:10 GMT+01:00 Jon Palmer notifications@github.com:

Hi Elena, Sorry I've been traveling for a few days. What version of linux are you on? It seems to me like the Python subprocess module is not acting the way it does on Mac or Ubuntu, it almost seems like it is launching a sub shell that does not have the same permissions and/or potentially same $PATH environment. I do not see this behavior on the Docker image. There are a few things that I can think of to try, but would be helpful to figure out if there is some nuances with the OS/python version you have with the subprocess module.

The BAMTOOLS_PATH error seems like it could be related, do you have the $BAMTOOLS_PATH environmental variable set in your ~/.bashrc? i.e. echo $BAMTOOLS_PATH

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-294227050, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWF9R6hOhqzpmYdchPf4nlJ7Dzxxkks5rv9KogaJpZM4MyBCC .

nextgenusfs commented 7 years ago

Hi Elena, Yes this error means that bam2hints is not compiled correctly. You will need to recompile bam2hints in the augustus installation - for more detailed instructions you can check the BRAKER1 manual/Augustus manual. Augustus can be a real pain to install correctly.

How did you get the other errors fixed? What was the problem with the subprocess/permissions error you had before?

el42008 commented 7 years ago

Hi Jon,

It seems funannotate is working properly at the moment for me. Could you explain me how it works when aligning the protein evidence given to funannotate when it goes to tblastn and exogenate?

Thanks a lot

2017-04-11 20:20 GMT+01:00 Jon Palmer notifications@github.com:

Okay. So that apparently isn't it, how about the repeatmodeler.lib file, was that created successfully?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-293372682, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWL85gkDl6-cwNheiUGhAoB7KSk1cks5ru9JrgaJpZM4MyBCC .

nextgenusfs commented 7 years ago

Exonerate is very slow. So the script uses tblastn to do a preliminary search to find regions of homology. Then the scripts pull out that region of the scaffolds where a potential hit is and runs exonerate protein2genome method. Only hits > 80% are retained.

el42008 commented 7 years ago

Hi Jon,

I have noticed that you have updated 6 hours ago the script funannotate-predict.py but I am using an old version. Is there any problem if I keep using the old one? At the moment the scripts seems to be working perfectly.

ESC[92m[11:50:57 AM]ESC[0m: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.13 ESC[92m[11:50:57 AM]ESC[0m: Running funannotate v0.6.0 ESC[92m[11:51:00 AM]ESC[0m: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER1 and BUSCO ESC[92m[11:51:09 AM]ESC[0m: Loadingsequences and soft-masking genome ESC[92m[11:51:09 AM]ESC[0m: Soft-masking:building RepeatModeler database ESC[92m[11:51:49 AM]ESC[0m: Soft-masking:generating repeat library using RepeatModeler ESC[92m[09:14:52 PM]ESC[0m: Soft-masking: running RepeatMasker with custom library ESC[92m[12:02:14AM]ESC[0m: Masked genome: 224,092 scaffolds; 730,142,347 bp ESC[92m[12:02:15AM]ESC[0m: Aligning transcript evidence to genome with GMAP ESC[92m[12:44:21AM]ESC[0m: Aligning transcript evidence to genome with BLAT ESC[92m[07:06:24AM]ESC[0m: Mapping proteins to genome usingtBlastn/Exonerate ESC[92m[07:06:27 AM]ESC[0m: Using 540,252 proteins as queries ESC[92m[07:06:27 AM]ESC[0m: Running pre-filter tBLASTN step ESC[92m[09:03:55 PM]ESC[0m: Found 901,139 preliminary alignments ESC[92m[03:13:02 AM]ESC[0m: Exonerate finished: found 33,180 alignments ESC[92m[03:15:48 AM]ESC[0m: Now launching BRAKER to train GeneMark and Augustus ESC[92m[07:07:56 AM]ESC[0m: 225,157 total gene modelsfrom all sources ESC[92m[07:07:57 AM]ESC[0m: Setting up EVM partitions ESC[92m[10:22:55 AM]ESC[0m: Generating EVM command list ESC[92m[10:24:24 AM]ESC[0m: Running EVM commands with 29 CPUs

I hope is finishing soon. Then I will launch the funannotate annotate.

Cheers

Elena

Hi Jon,

I am trying to make a summary of what funannotate pipeline does and I came out with something like the following. I might have said something that does not make sense at all. Would you tell me if I am right about the steps in funannotate?

Protein-coding sequence predictions were generated following a funannotate pipeline adapted to eukaryotes which first aligned S.verrucosum Mikado’s transcripts and reference potato and tomato transcripts to the assembly using GMAP. Then, for protein evidence it aligned uniprot plant proteins by using tblasnt and Splice- exonerate (ref) (a site-aware alignment algorithm). In this way, tblastn will report for each exon and specially for short ones multiple matches across the genome and then exonerate will segment the genome into matching regions based on the occurrence of such exon matches. Augustus (ref) and GeneMark-ET are then trained by BRAKER1 taking as input the RNAseq BAM files from the different tissues, protein and transcript evidence and once trained generates gene predictions. Finally it combines all predictions and lines of evidence into gene models using Evidence Modeler, it predicts tRNAs using tRNAscan-SE, filter out transposons (if a gene model is 90% contained within a repeat region it is removed and by searching with blastp repeat database from transposon PSI and RepBase) and bad gene models (internal stops, etc) using tbl2asn, rename gene models, and finally convert to GenBank format.

Functional annotations is then assigned to the gene models using PFAM, InterPro, UniProtKB, MEROPS proteases, CAZymes, Go ontology and BUSCO models ?? . Then, GAG brings all annotations together and annotates the genome. Duplicate annotations are removed and Genebank submission files are generated using tbl2asn. (this last part I haven't done it yet so I wonder if it will use all those protein database to do the actual annotation)

Thanks a lot for all your help and patience!😊

el42008 commented 7 years ago

Hi Jon,

How can I track what happened?? I took several days to get until EVM step. What do you think failed? ESC[92m[11:50:57 AM]ESC[0m: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.13 ESC[92m[11:50:57 AM]ESC[0m: Running funannotate v0.6.0 ESC[92m[11:51:00 AM]ESC[0m: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER1 and BUSCO ESC[92m[11:51:09 AM]ESC[0m: Loading sequences and soft-masking genome ESC[92m[11:51:09 AM]ESC[0m: Soft-masking: building RepeatModeler database ESC[92m[11:51:49 AM]ESC[0m: Soft-masking: generating repeat library using RepeatModeler ESC[92m[09:14:52 PM]ESC[0m: Soft-masking: running RepeatMasker with custom library ESC[92m[12:02:14 AM]ESC[0m: Masked genome: 224,092 scaffolds; 730,142,347 bp ESC[92m[12:02:15 AM]ESC[0m: Aligning transcript evidence to genome with GMAP ESC[92m[12:44:21 AM]ESC[0m: Aligning transcript evidence to genome with BLAT ESC[92m[07:06:24 AM]ESC[0m: Mapping proteins to genome using tBlastn/Exonerate ESC[92m[07:06:27 AM]ESC[0m: Using 540,252 proteins as queries ESC[92m[07:06:27 AM]ESC[0m: Running pre-filter tBLASTN step ESC[92m[09:03:55 PM]ESC[0m: Found 901,139 preliminary alignments ESC[92m[03:13:02 AM]ESC[0m: Exonerate finished: found 33,180 alignments ESC[92m[03:15:48 AM]ESC[0m: Now launching BRAKER to train GeneMark and Augustus ESC[92m[07:07:56 AM]ESC[0m: 225,157 total gene models from all sources ESC[92m[07:07:57 AM]ESC[0m: Setting up EVM partitions ESC[92m[10:22:55 AM]ESC[0m: Generating EVM command list ESC[92m[10:24:24 AM]ESC[0m: Running EVM commands with 29 CPUs ESC[92m[07:38:48 PM]ESC[0m: Combining partitioned EVM outputs ESC[92m[07:42:34 PM]ESC[0m: Converting EVM output to GFF3 ESC[92m[07:42:35 PM]ESC[0m: Collecting all EVM results ESC[92m[07:54:07 PM]ESC[0m: Evidence modeler has failed, exiting

Cheers

Elena

2017-04-26 11:08 GMT+01:00 Elena LOPEZ GIRONA elenalopezgirona@gmail.com:

Hi Jon,

I have noticed that you have updated 6 hours ago the script funannotate-predict.py but I am using an old version. Is there any problem if I keep using the old one? At the moment the scripts seems to be working perfectly.

ESC[92m[11:50:57 AM]ESC[0m: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.13ESC[92m[11:50:57 AM]ESC[0m: Running funannotate v0.6.0ESC[92m[11:51:00 AM]ESC[0m: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER1 and BUSCOESC[92m[11:51:09 AM]ESC[0m: Loading sequences and soft-masking genomeESC[92m[11:51:09 AM]ESC[0m: Soft-masking: building RepeatModeler databaseESC[92m[11:51:49 AM]ESC[0m: Soft-masking: generating repeat library using RepeatModelerESC[92m[09:14:52 PM]ESC[0m: Soft-masking: running RepeatMasker with custom libraryESC[92m[12:02:14 AM]ESC[0m: Masked genome: 224,092 scaffolds; 730,142,347 bpESC[92m[12:02:15 AM]ESC[0m: Aligning transcript evidence to genome with GMAPESC[92m[12:44:21 AM]ESC[0m: Aligning transcript evidence to genome with BLATESC[92m[07:06:24 AM]ESC[0m: Mapping proteins to genome using tBlastn/ExonerateESC[92m[07:06:27 AM]ESC[0m: Using 540,252 proteins as queriesESC[92m[07:06:27 AM]ESC[0m: Running pre-filter tBLASTN stepESC[92m[09:03:55 PM]ESC[0m: Found 901,139 preliminary alignmentsESC[92m[03:13:02 AM]ESC[0m: Exonerate finished: found 33,180 alignmentsESC[92m[03:15:48 AM]ESC[0m: Now launching BRAKER to train GeneMark and AugustusESC[92m[07:07:56 AM]ESC[0m: 225,157 total gene models from all sourcesESC[92m[07:07:57 AM]ESC[0m: Setting up EVM partitionsESC[92m[10:22:55 AM]ESC[0m: Generating EVM command listESC[92m[10:24:24 AM]ESC[0m: Running EVM commands with 29 CPUs I hope is finishing soon. Then I will launch the funannotate annotate.

Cheers

Elena

Hi Jon,

I am trying to make a summary of what funannotate pipeline does and I came out with something like the following. I might have said something that does not make sense at all. Would you tell me if I am right about the steps in funannotate?

Protein-coding sequence predictions were generated following a funannotate pipeline adapted to eukaryotes which first aligned S.verrucosum Mikado’s transcripts and reference potato and tomato transcripts to the assembly using GMAP. Then, for protein evidence it aligned uniprot plant proteins by using tblasnt and Splice- exonerate (ref) (a site-aware alignment algorithm). In this way, tblastn will report for each exon and specially for short ones multiple matches across the genome and then exonerate will segment the genome into matching regions based on the occurrence of such exon matches. Augustus (ref) and GeneMark-ET are then trained by BRAKER1 taking as input the RNAseq BAM files from the different tissues, protein and transcript evidence and once trained generates gene predictions. Finally it combines all predictions and lines of evidence into gene models using Evidence Modeler, it predicts tRNAs using tRNAscan-SE, filter out transposons (if a gene model is 90% contained within a repeat region it is removed and by searching with blastp repeat database from transposon PSI and RepBase) and bad gene models (internal stops, etc) using tbl2asn, rename gene models, and finally convert to GenBank format.

Functional annotations is then assigned to the gene models using PFAM, InterPro, UniProtKB, MEROPS proteases, CAZymes, Go ontology and BUSCO models ?? . Then, GAG brings all annotations together and annotates the genome. Duplicate annotations are removed and Genebank submission files are generated using tbl2asn. (this last part I haven't done it yet so I wonder if it will use all those protein database to do the actual annotation)

Thanks a lot for all your help and patience!😊 ---------- Forwarded message ---------- From: Jon Palmer notifications@github.com Date: 2017-04-22 22:44 GMT+01:00 Subject: Re: [nextgenusfs/funannotate] issue with headers? (#59) To: nextgenusfs/funannotate funannotate@noreply.github.com Cc: el42008 elenalopezgirona@gmail.com, Author < author@noreply.github.com>

Exonerate is very slow. So the script uses tblastn to do a preliminary search to find regions of homology. Then the scripts pull out that region of the scaffolds where a potential hit is and runs exonerate protein2genome method. Only hits > 80% are retained.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-296403582, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWEidoifSMKLXhwzmM9oEfJaoXmvtks5rynS2gaJpZM4MyBCC .

el42008 commented 7 years ago

Hi Jon,

This is what I got in teh logfile

2017-04-24 07:06:24,231: Mapping proteins to genome using tBlastn/Exonerate 2017-04-25 03:15:45,719: /mnt/apps/augustus/3.2.3/scripts/exonerate2hints.pl --in=fun_out/predict_misc/exonerate.out --out=fun_out/predict_misc/hints.P.gff --minintronlen=10 --maxintronlen=10000 2017-04-25 03:15:48,323: Now launching BRAKER to train GeneMark and Augustus 2017-04-26 07:06:51,297: perl /mnt/apps/EVidenceModeler/1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl fun_out/predict_misc/braker/potato1/augustus.gff3 2017-04-26 07:07:27,720: /mnt/apps/funannotate/util/genemark_gtf2gff3.pl fun_out/predict_misc/braker/potato1/GeneMark-ET/genemark.gtf 2017-04-26 07:07:35,421: perl /mnt/apps/EVidenceModeler/1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl fun_out/predict_misc/genemark.gff 2017-04-26 07:07:56,958: 225,157 total gene models from all sources 2017-04-26 19:54:07,204: Evidence modeler has failed, exiting

2017-04-26 23:17 GMT+01:00 Elena LOPEZ GIRONA elenalopezgirona@gmail.com:

Hi Jon,

How can I track what happened?? I took several days to get until EVM step. What do you think failed? ESC[92m[11:50:57 AM]ESC[0m: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.13 ESC[92m[11:50:57 AM]ESC[0m: Running funannotate v0.6.0 ESC[92m[11:51:00 AM]ESC[0m: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER1 and BUSCO ESC[92m[11:51:09 AM]ESC[0m: Loading sequences and soft-masking genome ESC[92m[11:51:09 AM]ESC[0m: Soft-masking: building RepeatModeler database ESC[92m[11:51:49 AM]ESC[0m: Soft-masking: generating repeat library using RepeatModeler ESC[92m[09:14:52 PM]ESC[0m: Soft-masking: running RepeatMasker with custom library ESC[92m[12:02:14 AM]ESC[0m: Masked genome: 224,092 scaffolds; 730,142,347 bp ESC[92m[12:02:15 AM]ESC[0m: Aligning transcript evidence to genome with GMAP ESC[92m[12:44:21 AM]ESC[0m: Aligning transcript evidence to genome with BLAT ESC[92m[07:06:24 AM]ESC[0m: Mapping proteins to genome using tBlastn/Exonerate ESC[92m[07:06:27 AM]ESC[0m: Using 540,252 proteins as queries ESC[92m[07:06:27 AM]ESC[0m: Running pre-filter tBLASTN step ESC[92m[09:03:55 PM]ESC[0m: Found 901,139 preliminary alignments ESC[92m[03:13:02 AM]ESC[0m: Exonerate finished: found 33,180 alignments ESC[92m[03:15:48 AM]ESC[0m: Now launching BRAKER to train GeneMark and Augustus ESC[92m[07:07:56 AM]ESC[0m: 225,157 total gene models from all sources ESC[92m[07:07:57 AM]ESC[0m: Setting up EVM partitions ESC[92m[10:22:55 AM]ESC[0m: Generating EVM command list ESC[92m[10:24:24 AM]ESC[0m: Running EVM commands with 29 CPUs ESC[92m[07:38:48 PM]ESC[0m: Combining partitioned EVM outputs ESC[92m[07:42:34 PM]ESC[0m: Converting EVM output to GFF3 ESC[92m[07:42:35 PM]ESC[0m: Collecting all EVM results ESC[92m[07:54:07 PM]ESC[0m: Evidence modeler has failed, exiting

Cheers

Elena

2017-04-26 11:08 GMT+01:00 Elena LOPEZ GIRONA elenalopezgirona@gmail.com :

Hi Jon,

I have noticed that you have updated 6 hours ago the script funannotate-predict.py but I am using an old version. Is there any problem if I keep using the old one? At the moment the scripts seems to be working perfectly.

ESC[92m[11:50:57 AM]ESC[0m: OS: linux2, 32 cores, ~ 264 GB RAM. Python: 2.7.13ESC[92m[11:50:57 AM]ESC[0m: Running funannotate v0.6.0ESC[92m[11:51:00 AM]ESC[0m: AUGUSTUS (3.2.3) detected, version seems to be compatible with BRAKER1 and BUSCOESC[92m[11:51:09 AM]ESC[0m: Loading sequences and soft-masking genomeESC[92m[11:51:09 AM]ESC[0m: Soft-masking: building RepeatModeler databaseESC[92m[11:51:49 AM]ESC[0m: Soft-masking: generating repeat library using RepeatModelerESC[92m[09:14:52 PM]ESC[0m: Soft-masking: running RepeatMasker with custom libraryESC[92m[12:02:14 AM]ESC[0m: Masked genome: 224,092 scaffolds; 730,142,347 bpESC[92m[12:02:15 AM]ESC[0m: Aligning transcript evidence to genome with GMAPESC[92m[12:44:21 AM]ESC[0m: Aligning transcript evidence to genome with BLATESC[92m[07:06:24 AM]ESC[0m: Mapping proteins to genome using tBlastn/ExonerateESC[92m[07:06:27 AM]ESC[0m: Using 540,252 proteins as queriesESC[92m[07:06:27 AM]ESC[0m: Running pre-filter tBLASTN stepESC[92m[09:03:55 PM]ESC[0m: Found 901,139 preliminary alignmentsESC[92m[03:13:02 AM]ESC[0m: Exonerate finished: found 33,180 alignmentsESC[92m[03:15:48 AM]ESC[0m: Now launching BRAKER to train GeneMark and AugustusESC[92m[07:07:56 AM]ESC[0m: 225,157 total gene models from all sourcesESC[92m[07:07:57 AM]ESC[0m: Setting up EVM partitionsESC[92m[10:22:55 AM]ESC[0m: Generating EVM command listESC[92m[10:24:24 AM]ESC[0m: Running EVM commands with 29 CPUs I hope is finishing soon. Then I will launch the funannotate annotate.

Cheers

Elena

Hi Jon,

I am trying to make a summary of what funannotate pipeline does and I came out with something like the following. I might have said something that does not make sense at all. Would you tell me if I am right about the steps in funannotate?

Protein-coding sequence predictions were generated following a funannotate pipeline adapted to eukaryotes which first aligned S.verrucosum Mikado’s transcripts and reference potato and tomato transcripts to the assembly using GMAP. Then, for protein evidence it aligned uniprot plant proteins by using tblasnt and Splice- exonerate (ref) (a site-aware alignment algorithm). In this way, tblastn will report for each exon and specially for short ones multiple matches across the genome and then exonerate will segment the genome into matching regions based on the occurrence of such exon matches. Augustus (ref) and GeneMark-ET are then trained by BRAKER1 taking as input the RNAseq BAM files from the different tissues, protein and transcript evidence and once trained generates gene predictions. Finally it combines all predictions and lines of evidence into gene models using Evidence Modeler, it predicts tRNAs using tRNAscan-SE, filter out transposons (if a gene model is 90% contained within a repeat region it is removed and by searching with blastp repeat database from transposon PSI and RepBase) and bad gene models (internal stops, etc) using tbl2asn, rename gene models, and finally convert to GenBank format.

Functional annotations is then assigned to the gene models using PFAM, InterPro, UniProtKB, MEROPS proteases, CAZymes, Go ontology and BUSCO models ?? . Then, GAG brings all annotations together and annotates the genome. Duplicate annotations are removed and Genebank submission files are generated using tbl2asn. (this last part I haven't done it yet so I wonder if it will use all those protein database to do the actual annotation)

Thanks a lot for all your help and patience!😊 ---------- Forwarded message ---------- From: Jon Palmer notifications@github.com Date: 2017-04-22 22:44 GMT+01:00 Subject: Re: [nextgenusfs/funannotate] issue with headers? (#59) To: nextgenusfs/funannotate funannotate@noreply.github.com Cc: el42008 elenalopezgirona@gmail.com, Author < author@noreply.github.com>

Exonerate is very slow. So the script uses tblastn to do a preliminary search to find regions of homology. Then the scripts pull out that region of the scaffolds where a potential hit is and runs exonerate protein2genome method. Only hits > 80% are retained.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-296403582, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWEidoifSMKLXhwzmM9oEfJaoXmvtks5rynS2gaJpZM4MyBCC .

nextgenusfs commented 7 years ago

There is another folder containing EVM intermediate files and then in the log files folder is a log file for EVM, so check that for any errors.

el42008 commented 7 years ago

2017-04-26 23:26:05,308: /mnt/apps/funannotate/bin/funannotate-runEVM.py fun_out/logfiles/funannotate-EVM.log 30 --genome /mnt/shared/users/el42208/RNA_data/funannotate/fun_out/predict_misc/genome.softmasked.fa --gene_predictions /mnt/shared/users/el42208/RNA_data/funannotate/fun_out/predict_misc/gene_predictions.gff3 --protein_alignments /mnt/shared/users/el42208/RNA_data/funannotate/fun_out/predict_misc/protein_alignments.gff3 --transcript_alignments /mnt/shared/users/el42208/RNA_data/funannotate/fun_out/predict_misc/transcript_alignments.gff3 --weights /mnt/shared/users/el42208/RNA_data/funannotate/fun_out/predict_misc/weights.evm.txt --min_intron_length 10 /mnt/shared/users/el42208/RNA_data/funannotate/fun_out/predict_misc/evm.round1.gff3

funannotate-EVM.log (END)

2017-04-26 23:26 GMT+01:00 Jon Palmer notifications@github.com:

There is another folder containing EVM intermediate files and then in the log files folder is a log file for EVM, so check that for any errors.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/59#issuecomment-297559315, or mute the thread https://github.com/notifications/unsubscribe-auth/AP4RWBmMYK_00HZbFDdrDcJDzmNzNAiAks5rz8SfgaJpZM4MyBCC .

nextgenusfs commented 7 years ago

That's all the logfiles/funannotate-EVM.py file says???

nextgenusfs commented 7 years ago

Elena, can you start a new thread with the EVM question. As the heading on here has to do with BAM file headers. So will be easier for others to search for troubleshooting in the issues if there is 1 bug/problem per thread.

Unalibun commented 4 years ago

That's all the logfiles/funannotate-EVM.py file says???

Elena, can you start a new thread with the EVM question. As the heading on here has to do with BAM file headers. So will be easier for others to search for troubleshooting in the issues if there is 1 bug/problem per thread.

Hi Jon, I am having the same error than Elena in my ENV.log, did you know if there is any new issue to solve the trouble?

nextgenusfs commented 4 years ago

@unalibun please open a new issue and describe your bug (version of software, command, specific error, etc).

MesYosra commented 1 year ago

Hello @nextgenusfs , I have the same error: Genome assembly error: headers contain more characters than the maximum (16), reformat headers to continue. I looked at the code here : https://github.com/nextgenusfs/funannotate/blob/master/funannotate/annotate.py There is no 16 but I don't know why (the maximum of characters is 16 so in the code depends on header_length)

nextgenusfs commented 1 year ago

That is the max length of characters in the FASTA headers supported by NCBI.