Problem with Evidencemodeler

LemoAlex commented 3 years ago

Hello funannotate users,

I am currently using funanotate v1.8.4, installed through docker, and funannotate check and testing works without issues.

I am trying to run funannotate predict on some fish genome assembly.

So, when I run:

funannotate-docker predict -i ~softmasked.genome.fasta -o ./output1 -s "Species name" --transcript_evidence Transcriptome.fasta --optimize_augustus --other_gff /home/alexandre/funannotate/Species.transdecoder.gff3 --protein_evidence uniprot.reviewed.fasta uniprot-reviewed.fasta --organism other --rna_bam ~/funannotate/alignment.bam --weights codingquarry:1 --cpus 4

Everything runs smoothly until the EvidenceModeler part. Then, I get this message :

funannotate-EVM.log EVM: partitioning input to ~ 35 genes per partition Traceback (most recent call last): File "/venv/lib/python3.7/site-packages/funannotate/aux_scripts/funannotate-runEVM.py", line 433, in partitions=args.no_partitions) File "/venv/lib/python3.7/site-packages/funannotate/aux_scripts/funannotate-runEVM.py", line 203, in create_partitions k, len(SeqRecords[k]))) File "/venv/lib/python3.7/site-packages/Bio/File.py", line 248, in getitem record = self._proxy.get(self._offsets[key]) KeyError: 'scaffold_1' [Jan 11 08:24 AM]: Evidence modeler has failed, exiting Traceback (most recent call last): File "/venv/bin/funannotate", line 713, in main() File "/venv/bin/funannotate", line 703, in main mod.main(arguments) File "/venv/lib/python3.7/site-packages/funannotate/predict.py", line 1730, in main os.remove(EVM_out) FileNotFoundError: [Errno 2] No such file or directory: '~/output1/predict_misc/evm.round1.gff3'

The EVM logfile (attached) does not show any error, so I am a bit confused with what's going on here.

Thanks for the help, Best, Alexandre funannotate-EVM.log

nextgenusfs commented 3 years ago

This seems odd from logfile.

[01/11/21 08:23:44]: 9,557 total contigs; skipping -51,760 contigs with no genes

Do you have the predict logfile that I could look at as well?

LemoAlex commented 3 years ago

Yes, here it is attached. funannotate-predict.log

Thanks for your help.

Alexandre

nextgenusfs commented 3 years ago

Hmm, okay thanks. I can't quite tell, but maybe looks like the command line around the --species argument perhaps isn't getting passed properly, ie if you look at the log file that is printing the command:

/venv/bin/funannotate predict -i /home/alexandre/funannotate/fish.masked.fa -o ./output1 -s Species name--transcript_evidence /home/alexandre/funannotate/Alignment/Tran.fa --optimize_augustus --other_gff /home/alexandre/funannotate/Tran.fa.transdecoder.gff3 --protein_evidence uniprot-catfish-reviewed.fasta uniprot-zebrafish-reviewed.fasta --organism other --rna_bam /home/alexandre/funannotate/sorted.bam --weights codingquarry:1 --cpus 4

I don't know how that would necessarily be causing problems per say with EVM.... but seems like maybe just a typo? In your initial command above there is clearly a space.

-s Species name--transcript_evidence /home/alexandre/funannotate/Alignment/Tran.fa

So assuming above is not related to error, you can try to run the EVM command from that same directory and maybe that will yield more info to stdout, ie:

funannotate-docker /venv/bin/python /venv/lib/python3.7/site-packages/funannotate/aux_scripts/funannotate-runEVM.py -w /home/alexandre/funannotate/output1/predict_misc/weights.evm.txt -c 4 -g /home/alexandre/funannotate/output1/predict_misc/gene_predictions.gff3 -d /home/alexandre/funannotate/output1/predict_misc/EVM -f /home/alexandre/funannotate/output1/predict_misc/genome.softmasked.fa -l ./output1/logfiles/funannotate-EVM.log -m 10 -o /home/alexandre/funannotate/output1/predict_misc/evm.round1.gff3 --EVM_HOME /venv/opt/evidencemodeler-1.1.1 -p /home/alexandre/funannotate/output1/predict_misc/protein_alignments.gff3 -t /home/alexandre/funannotate/output1/predict_misc/transcript_alignments.gff3

nextgenusfs commented 3 years ago

Actually that will probably fail based on what I have in the bash script, you can create a new bash wrapper like this that will just run the image (it is same just doesn't include call to funannotate):

#!/usr/bin/env bash

realpath() {
  OURPWD=$PWD
  cd "$(dirname "$1")"
  LINK=$(readlink "$(basename "$1")")
  while [ "$LINK" ]; do
    cd "$(dirname "$LINK")"
    LINK=$(readlink "$(basename "$1")")
  done
  REALPATH="$PWD/$(basename "$1")"
  cd "$OURPWD"
  echo "$REALPATH"
}

timezone() {
    if [ "$(uname)" == "Darwin" ]; then
        TZ=$(readlink /etc/localtime | sed 's#/var/db/timezone/zoneinfo/##')
    else
        TZ=$(readlink /etc/timezone)
    fi
    echo $TZ
}

# Only allocate tty if one is detected. See - https://stackoverflow.com/questions/911168
if [[ -t 0 ]]; then IT+=(-i); fi
if [[ -t 1 ]]; then IT+=(-t); fi

USER="$(id -u $(logname)):$(id -g $(logname))"
WORKDIR="$(realpath .)"
MOUNT="type=bind,source=${WORKDIR},target=${WORKDIR}"
TZ="$(timezone)"

exec docker run --rm "${IT[@]}" --user "${USER}" -e TZ="${TZ}" --workdir "${WORKDIR}" --mount "${MOUNT}" nextgenusfs/funannotate:latest "$@"

nextgenusfs commented 3 years ago

Here is a generalized version of this bash script -- you could run with any docker container: https://github.com/nextgenusfs/dw/

LemoAlex commented 3 years ago

Hello again,

Thanks for the answers. I tried by removing the spaces in the species name, but I still get the same error .

I also tried running the EVM step using the bash script through dw, but again I get the exact same output as I did when running the whole pipeline. I also get (I had it before aswell), a single file called : genes.1.bed in the predict_mis/EVM folder. It feels like EVM can't go past the first scaffold, could this be possible?

Thanks, Alexandre

nextgenusfs commented 3 years ago

~~I suppose it could be running out of RAM. Can you increase the RAM allocated to docker?~~

Nevermind, saw your log file and it is already 264 GB.

When you call this are all of the files you are passing to the docker container located in the same run directory?

Other thing to try would be to just move into the docker image interactively and then try to run the EVM workflow, ie docker run -it -v {need to mount filesystem folders} nextgenusfs/funannotate /bash/bin

And then lastly, I assume the test dataset runs on your system?

funannotate-docker test -t rna-seq --cpus XX

nextgenusfs commented 3 years ago

One other thing to try would be to delete all of the EVM temp files and then try to add --no-evm-partitions to your predict command (I just realized its not in the help menu) -- but this will run the partitioning differently if that is what is causing EVM to die.

nextgenusfs commented 3 years ago

But going back to my original thought in the EVM log file, that this line seems strange:

[01/11/21 08:23:44]: 9,557 total contigs; skipping -51,760 contigs with no genes

What is happening in the code is this:

    # sort the results by contig and position
    ChrGeneCounts = {}
    sortedResults = natsorted(Results, key=lambda x: (x[0], x[1]))
    with open(bedGenes, 'w') as outfile:
        for x in sortedResults:
            outfile.write('{}\t{}\t{}\t{}\t{}\t{}\n'.format(x[0], x[1], x[2],
                                                            x[3], x[4], x[5]))
            if not x[0] in ChrGeneCounts:
                ChrGeneCounts[x[0]] = 1
            else:
                ChrGeneCounts[x[0]] += 1
    ChrNoGenes = len(SeqRecords) - len(ChrGeneCounts)
    lib.log.debug('{:,} total contigs; skipping {:,} contigs with no genes'.format(len(SeqRecords), ChrNoGenes))

This suggests something is wrong with the input files (something I've not seen before), it it is saying that it somehow found >50k contigs that don't have genes associated with them.

This suggests that something is wrong with the headers on one of these input files -- can you validate that the input files have appropriate FASTA/Sequence headers? For example, the custom GFF that you are passing do they match the genome FASTA headers? And the BAM file as well, do the headers match?

LemoAlex commented 3 years ago

For example, the custom GFF that you are passing do they match the genome FASTA headers?

Ok, maybe the problem is there! My GFF file comes from Transdecoder, but I used the transcriptome as an input. So obviously, the transcriptome and the genome don't have the same headers. Could the problem come from there? What could I use as an alternative then?

Thanks,

Alexandre

nextgenusfs commented 3 years ago

So if the transcripts aren't aligned to the genome reference then it shouldn't be passed as GFF_other. If you have transcripts from Transdecoder that you want to align, you can pass those as FASTA format to --transcript_evidence -- this option takes multiple inputs as space delimited.

Maybe its not obvious -- but the pipeline might work a lot better if you let funannotate train run Trinity/PASA/transdecoder. That way those tools get run in a way that funannotate knows the format....

LemoAlex commented 3 years ago

Hi,

Sorry for the long delay. Just to let you know that I ran it as you suggested and I was able to finish the whole pipeline successfully, so thank you!

Best, Alexandre

nextgenusfs / funannotate

Problem with Evidencemodeler #528