Issue funannotate train

BenjaminSchwessinger commented 3 years ago

Hi, Thanks for much improving funannotate. I am back after 3 years and it seems a huge step forward.

I am using funannotate v 1.8.1 on Linux and trying to run the following command.

funannotate train --cpus 20 --species "Awesome" --strain SOMETHING --trinity Trinity-GG.combined.fasta -i Awesome.masked.fasta --out train

The pasa does run for a LOOONG time and at one point raises an error CMD ERROR: /home/benjamin/anaconda3/envs/funannotate/opt/pasa-2.4.1/Launch_PASA_pipeline.pl xxxxx.

The pipeline does continue and raised the following error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 435572: ordinal not in range(128) Logged from file library.py, line 745

Logfiles

The funannotate log file ends w/

CMD: touch /home/benjamin/genome_assembly/Pst198E16_v1/funannotate/predict/train/training/pasa/trinity.fasta.clean.transdecoder_dir.__checkpoints_longorfs/TD.longorfs.ok loading trinity.fasta.clean.transdecoder.gff3.fl_accs

And also shows the pasa CMD ERROR but not the python Unicode error.

The pasa-assembly.log is empty.

I am not sure if this all run to completion or not. Any pointers would be greatly appreciated.

nextgenusfs commented 3 years ago

Hi @BenjaminSchwessinger -- hope you are well and welcome back! Can you confirm that the rna-seq tests complete without error? That should allow us to see if install is okay, ie.

funannotate test -t rna-seq --cpus 20

So if pasa log is empty, it seems to suggest that it got stuck somewhere. The unicode/decode error is a classic py2/3 issue, are you on py2 or py3?

If possible to share the log file, that will have some more info that might be useful.

nextgenusfs commented 3 years ago

I don't see many changes to train in the master from the last release https://github.com/nextgenusfs/funannotate/compare/v1.8.1...master. But I'd recommend updating the code to master to make sure you don't run into any other things I've fixed but haven't had the time to release a new version yet. You can install over the top of your existing conda by running:

python -m pip install --no-deps --force git+https://github.com/nextgenusfs/funannotate.git

BenjaminSchwessinger commented 3 years ago

Hi Jon, Thanks for the pointers. I updated all and funannotate test -t rna-seq --cpus 20 ran to completion w/o any issues.

I will try to re-run with my own dataset.

BenjaminSchwessinger commented 3 years ago

Started my run again. Question. Do I need to provide RNAseq data for training when I have precomputed Trinity? It says -l/r/s is required but doesn't enforce it at the start. I saw that this is used for Kallisto to get the most supported gene models at one location as I ran your test dataset. Is this correct?

BenjaminSchwessinger commented 3 years ago

Also the PASA step in train is pretty slow. Is there any way to speed this up a bit?

nextgenusfs commented 3 years ago

Per providing RNA-seq data -- it is preferred as it will use kallisto to choose the best transcript at each locus. It used to be required, I think I relaxed that requirement. The "smoothest" way to run it is just to give it your raw RNA-seq reads and let it do its thing (trimmomatic, normalization, trinity assembly, pasa, choose best transcripts for training). The alternative transcripts get added back at the funannotate update step.

PASA is slow because default is to use SQLite -- which isn't multi-thread capable. If you setup mysql database with PASA it should be faster, you have to do this manually with your PASA install (however, it only needs to be done once). https://github.com/PASApipeline/PASApipeline/wiki/Pasa_installation_instructions. If you have PASA setup with mysql, then you can pass --pasa_db mysql. If it's not properly setup it will likely die.

nextgenusfs commented 3 years ago

And the default is sqlite because many shared compute facilities will not let users run mysql out of security concerns....

BenjaminSchwessinger commented 3 years ago

Thanks for the explanation and quick reply. I have a machine with mysql and a HPC without. Hence splitting out some of the tasks makes sense computationally. I will set up the mysql for as you suggested on the smaller machine.

For multiple sample RNAseq do you suggest to simply cut together all the cleaned reads from all conditions and provide them to funannotat?

Thanks for all the help.

From: Jon Palmer notifications@github.com Sent: Saturday, November 21, 2020 9:56 AM To: nextgenusfs/funannotate funannotate@noreply.github.com Cc: Benjamin Schwessinger benjamin.schwessinger@anu.edu.au; Mention mention@noreply.github.com Subject: Re: [nextgenusfs/funannotate] Issue funannotate train (#516)

And the default is sqlite because many shared compute facilities will not let users run mysql out of security concerns....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/nextgenusfs/funannotate/issues/516#issuecomment-731448981, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABRMZBV7IG4OQGRSXWLB6J3SQ3X2TANCNFSM4T2XZK7Q.

nextgenusfs commented 3 years ago

You can just pass the rna seq data to train and it will combine and run normalization prior to Trinity assembly. So as many different conditions you have can be helpful. Trinity works best with stranded PE, if you have mixture of data types then it will run in non stranded mode which usually doesn't produce the best assembly.

But basically my recommendation is to try the defaults first and see how the results look.

hyphaltip commented 3 years ago

Sorry late to this thread - My recall is there were some problems running train with assembled trinity transcripts only unless raw reads were also provided. Mysql much much faster so worth the effort to setup on separate run if you can.

BenjaminSchwessinger commented 3 years ago

Thanks all I now completed setup and could train with reads and precalcualted Trinity.fastas. Off to the predict step for now. Thanks for all the help. Much appreciated.

nextgenusfs / funannotate

Issue funannotate train #516