linsalrob / PhiSpy

Prediction of prophages from bacterial genomes
MIT License
70 stars 20 forks source link

No bases were counted for orf #57

Open simone-pignotti opened 2 years ago

simone-pignotti commented 2 years ago

Hello, I occasionally run into this issue when running PhySpy:

2022-03-28 12:14:53 INFO     Welcome to PhiSpy.py version 4.2.21
2022-03-28 12:14:53 INFO     Starting PhiSpy.py with the following arguments
Namespace(infile='input/genomic.gbff', output_dir='/home/ec2-user/physpy', make_training_data=None, training_set='data/trainSet_genericAll.txt', list=False, file_prefix='test', evaluate=False, number=5, min_contig_size=5000, window_size=30, nonprophage_genegaps=10, phage_genes=1, metrics=['orf_length_med', 'shannon_slope', 'at_skew', 'gc_skew', 'max_direction'], randomforest_trees=500, expand_slope=False, kmers_type='all', phmms='/home/ec2-user/VOGs.hmms', include_annotations=True, ignore_annotations=False, color=True, threads=4, output_choice=512, include_all_repeats=False, keep_dropped_predictions=False, extra_dna=2000, min_repeat_len=10, log='/home/ec2-user/physpy/test_phispy.log', quiet=False, keep=False, logger=<Logger PhiSpy (Level 5)>)
2022-03-28 12:14:54 INFO     Processing 14 contigs
2022-03-28 12:14:54 INFO     Running HMM profiles against /home/ec2-user/VOGs.hmms
2022-03-28 12:14:54 INFO     hmmsearch: writing the amino acids to temporary file /home/ec2-user/physpy/tmpio1svuew
2022-03-28 12:14:54 INFO     Searching 2613 proteins with hmmsearch.
2022-03-28 12:18:15 INFO     Completed running HMM profiles against /home/ec2-user/VOGs.hmms
2022-03-28 12:18:15 INFO     Making Testing Set...
2022-03-28 12:18:17 INFO     a total of zero total_at*total_gc
No bases were counted for orf {'start': 507191, 'stop': 508927, 'phmm': 0.18568636235841013, 'peg': 'peg', 'is_phage': 0} from 507191 to 508927
This error is usually thrown with an exceptionally short ORF that is only a  few bases. You should check this ORF and confirm it is real!

I can't find anything weird in the ORF throwing the exception. This error makes the entire run fail, which is not what I would expect given that other ORFs are simply ignored and raise warnings (e.g. when there are multiple ORFs with the same ID and all but the first are discarded). Would it be possible to have more details about what may be triggering the error, and eventually convert this to a warning in future versions of PhySpy? Unfortunately I cannot share the input file, and I haven't managed to replicate the error on a small example, but running PhySpy on many random genomes downloaded from NCBI should enable you to replicate it. Let me know if I can help in any other way. Thanks for maintaining this great tool!

simone-pignotti commented 2 years ago

PS this has already been described in #54, but since the main topic of that issue was different I figured this would deserve its own. Feel free to merge them if not!