nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
314 stars 83 forks source link

Funannotate update - Did not recognise the LOCUS line layout #353

Closed olekto closed 4 years ago

olekto commented 4 years ago

Are you using the latest release? Using 1.7.0

Describe the bug When running funannotate update, I get this error:

[01:18 AM]: OS: linux2, 80 cores, ~ 198 GB RAM. Python: 2.7.15
[01:18 AM]: Running 1.7.0
[01:18 AM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt'
[01:18 AM]: Found relevant files in fish/training, will re-use them:
        Forward reads: fish/training/left.fq.gz
        Reverse reads: fish/training/right.fq.gz
        Forward Q-trimmed reads: fish/training/trimmomatic/trimmed_left.fastq.gz
        Reverse Q-trimmed reads: fish/training/trimmomatic/trimmed_right.fastq.gz
        Forward normalized reads: fish/training/normalize/left.norm.fq
        Reverse normalized reads: fish/training/normalize/right.norm.fq
        Trinity results: fish/training/funannotate_train.trinity-GG.fasta
        PASA config file: fish/training/pasa/alignAssembly.txt
        BAM alignments: fish/training/funannotate_train.coordSorted.bam
        StringTie GTF: fish/training/funannotate_train.stringtie.gtf
Traceback (most recent call last):
  File "/progs/miniconda3/envs/funannotate/bin/funannotate", line 657, in <module>
    main()
  File "/progs/miniconda3/envs/funannotate/bin/funannotate", line 647, in main
    mod.main(arguments)
  File "/progs/miniconda3/envs/funannotate/lib/python2.7/site-packages/funannotate/update.py", line 1755, in main
    elif lib.checkRefSeq(GBK):
  File "/progs/miniconda3/envs/funannotate/lib/python2.7/site-packages/funannotate/library.py", line 2652, in checkRefSeq
    for record in SeqIO.parse(infile, 'genbank'):
  File "/progs/miniconda3/envs/funannotate/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 661, in parse
    for r in i:
  File "/progs/miniconda3/envs/funannotate/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 493, in parse_records
    record = self.parse(handle, do_features)
  File "/progs/miniconda3/envs/funannotate/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 477, in parse
    if self.feed(handle, consumer, do_features):
  File "/progs/miniconda3/envs/funannotate/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 444, in feed
    self._feed_first_line(consumer, self.line)
  File "/progs/miniconda3/envs/funannotate/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 1461, in _feed_first_line
    raise ValueError('Did not recognise the LOCUS line layout:\n' + line)
ValueError: Did not recognise the LOCUS line layout:
LOCUS       LoL_20191026_scaffold_14671229 bp   DNA    linear       04-DEC-2019

Where is this input file made? And can I change how it is made? It seems that the SeqID field and the amount of base pairs overlap. Is there a limit on anything here? It seems SeqID can be up to 30 characters at least, and the bp field should be able to handle several hundred million bases (I hope). Is there an easy way to fix this?

I just want a GFF and some predicted transcripts and proteins, and am not that interested in Genbank files. I do understand that it is need as an intermediate format inside funannotate.

Thank you.

Ole

What command did you issue? funannotate update -i fish --cpus 20 > update.out 2> update.err

nextgenusfs commented 4 years ago

Fasta headers have 16 character limit for NCBI, you should have gotten a warning when you ran funannotate predict. I think this is because there isn't enough space in the Genbank flat file format. Genbank format is integrated into funannotate as the original goal of pipeline is to create NCBI read submissions, so this won't be changed. Simplest is just rename your fasta headers and run pipeline again.

nextgenusfs commented 4 years ago

Not sure but possible work around would be to pass the fasta and GFF files to funannotate update as opposed to the funannotate directory.

olekto commented 4 years ago

I have adjusted the accepted header length to 40, since I don't plan to submit to NCBI. I was not aware that it could hinder parts of funannotate itself.

I guess I might do 'funannotate annotate' first, and then maybe try passing the resulting fasta and GFF files to funannotate update. Then I will need to input the RNA-seq data again, but that is fine.

nextgenusfs commented 4 years ago

The update script reads the genbank file. So the header length requirement is because there isn’t enough space in the genbank format to allow for long header names, they run into the next field. So it will generate genbank files with truncated headers which is what is causing the error.

olekto commented 4 years ago

Thank you. I find it a bit unfortunate that I am restricted by the genbank format, when I am not planning to submit to genbank.

Will I get the same issue with functional annotation? I would rather not have to rename my fasta headers for the different species I work with, and then also rename all the different GFF files and such.

nextgenusfs commented 4 years ago

You can pass the scripts a GFF and FASTA file -- however, the Genbank flat files it writes won't be readable. Augustus training actually uses an intermediate Genbank flat file format, so long header names will cause the training to fail (at least it used to and not sure if this has been corrected or not).

Sorry for the inconvenience but Genbank was an early design choice -- GFF is not a great format due to lack of standardization in respect to most functional information in a single column. The number of GFF parsers in this code base is numerous, due to every tool outputting a slightly different (in)compatible format. The benefit (and at the same time drawback) of Genbank is that it is highly structured, meaning its much easier to reliably add functional annotation, etc. And biopython has a decent parser.

I originally wrote this tool because Maker based GFF3 output was nearly impossible to reformat for NCBI submission, resulting in hundreds of errors that needed to be manually fixed.