mtisza1 / Cenote-Taker2

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)
MIT License
55 stars 7 forks source link

Hangs with no feedback #6

Closed DarrenObbard closed 3 years ago

DarrenObbard commented 3 years ago

Hi,

I'm playing with Cenote-Taker2 for the first time, and (as far as I can tell) it keeps hanging: i.e. simply stopping execution with no feedback or continued output or execution. There are a couple of errors thrown, but no indication as to what might cause them or what the solution might be.

The command looks like this

python ~/apps/CenoteTaker2/run_cenote-taker2.py -c LongWebster.fasta --known_strains blast_knowns --blastn_db /data/BLAST_databases/nt -r WebsterMelRebuild -m 150 -t 40 -p False /data/home/dobbard/apps/CenoteTaker2

and things start well

######################################################################


prodigal found
BWA found
samtools found
mummer found
circlator found
blastp found
blastn found
blastx found
rpsblast found
bioawk found
efetch found
ktClassifyBLAST found
hmmscan found
bowtie2 found
tRNAscan-SE found
pileup.sh found
tbl2asn found
getorf found
transeq found
@@@@@@@@@@@@@@@@@@@@@@@@@
Your specified arguments:
original contigs:                  LongWebster.fasta
forward reads:                     /data/home/dobbard/scratch/test_cenote/no_reads
reverse reads:                     /data/home/dobbard/scratch/test_cenote/no_reads
title of this run:                 WebsterMelRebuild
Isolate source:                    unknown
collection date:                   unknown
metagenome_type:                   unknown
SRA run number:                    unknown
SRA experiment number:             unknown
SRA sample number:                 unknown
Bioproject number:                 unknown
template file:                     /data/home/dobbard/apps/CenoteTaker2/dummy_template.sbt
minimum circular contig length:    1000
minimum linear contig length:      1000
virus domain database:             standard
min. viral hallmarks for linear:   1
min. viral hallmarks for circular: 1
handle known seqs:                 blast_knowns
contig assembler:                  unknown_assembler
DNA or RNA:                        DNA
HHsuite tool:                      hhblits
original or TPA:                   original
Do BLASTP?:                        no_blastp
Do Prophage Pruning?:              False
Filter out plasmids?:              True
Run BLASTN against nt?             /data/BLAST_databases/nt
Location of Cenote scripts:        /data/home/dobbard/apps/CenoteTaker2
Location of scratch directory:     none
GB of memory:                      150
number of CPUs available for run:  40
Annotation mode?                   False
@@@@@@@@@@@@@@@@@@@@@@@@@
scratch space will not be used in this run
HHsuite database locations:
/data/home/dobbard/apps/CenoteTaker2/NCBI_CD/NCBI_CD
/data/home/dobbard/apps/CenoteTaker2/pfam_32_db/pfam
/data/home/dobbard/apps/CenoteTaker2/pdb70/pdb70
/data/home/dobbard/scratch/test_cenote/LongWebster.fasta
time update: locating inputs:  03-11-21---09:01:43
/data/home/dobbard/scratch/test_cenote/LongWebster.fasta
File with .fasta extension detected, attempting to keep contigs over 1000 nt and find circular sequences with apc.pl
WebsterMelRebuild121.fasta has DTRs/circularity
WebsterMelRebuild189.fasta has DTRs/circularity
WebsterMelRebuild249.fasta has DTRs/circularity
WebsterMelRebuild643.fasta has DTRs/circularity
WebsterMelRebuild652.fasta has DTRs/circularity
WebsterMelRebuild88.fasta has DTRs/circularity
no reads provided or reads not found
Circular fasta file(s) detected

Putting non-circular contigs in a separate directory
time update: running IRF for ITRs in non-circular contigs 03-11-21---09:02:22
time update: running prodigal on linear contigs  03-11-21---09:02:24
time update: running linear contigs with hmmscan against virus hallmark gene database: standard  03-11-21---09:02:39
time update: Calling ORFs for circular/DTR sequences with prodigal  03-11-21---09:02:49
time update: running hmmscan on circular/DTR contigs  03-11-21---09:02:50
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
 Grabbing ORFs wihout RPS-BLAST hits and separating them into individual files for HHsearch
time update: running HHsearch or HHblits  03-11-21---09:02:50
 Combining tbl files from all search results AND fix overlapping ORF module
No ITR contigs with minimum hallmark genes found.
Annotating linear contigs
time update: running BLASTX, annotate linear contigs  03-11-21---09:02:50
time update: running Prodigal, annotate linear contigs  03-11-21---09:04:32
time update: running hmmscan1, annotating linear contigs  03-11-21---09:04:34
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: expression for `>>' redirection has null string value
time update: running hmmscan2, annotating linear contigs  03-11-21---09:04:35
cat: SPLIT_DTR_HMM2_GENOME_AA_*AA.hmmscan2.out: No such file or directory

###################################################################################

But the failed awk and the failed cat suggest something is going wrong. At this point it appears nothing is running, so I am suspicious that cat is attempting to read from stdin because there was no file?

also, the missing file requested in line 547

sed 's/ /#/g' $REMAINDER | bioawk -c fastx '{print ">"$name"#DTRs" ; print $seq}' | 's/#/ /g' >> other_contigs/non_viral_domains_contigs.fna

doesn't bode well.

DarrenObbard commented 3 years ago

Tried it again with a slightly different (larger) test file. Again it seems to hang, but at a different point. I say 'hang' because although it hasn't thrown an error, there appears to be nothing happening (no tasks running, not memory being allocated)

File with .fasta extension detected, attempting to keep contigs over 1000 nt and find circular sequences with apc.pl
WebsterMelRebuild1021.fasta has DTRs/circularity
WebsterMelRebuild1022.fasta has DTRs/circularity
WebsterMelRebuild1066.fasta has DTRs/circularity
WebsterMelRebuild1502.fasta has DTRs/circularity
WebsterMelRebuild1591.fasta has DTRs/circularity
WebsterMelRebuild1757.fasta has DTRs/circularity
WebsterMelRebuild1758.fasta has DTRs/circularity
WebsterMelRebuild1964.fasta has DTRs/circularity
WebsterMelRebuild2062.fasta has DTRs/circularity
WebsterMelRebuild2440.fasta has DTRs/circularity
WebsterMelRebuild2522.fasta has DTRs/circularity
WebsterMelRebuild2523.fasta has DTRs/circularity
WebsterMelRebuild2524.fasta has DTRs/circularity
WebsterMelRebuild2525.fasta has DTRs/circularity
WebsterMelRebuild2526.fasta has DTRs/circularity
WebsterMelRebuild2742.fasta has DTRs/circularity
WebsterMelRebuild3594.fasta has DTRs/circularity
WebsterMelRebuild3595.fasta has DTRs/circularity
WebsterMelRebuild3596.fasta has DTRs/circularity
WebsterMelRebuild3671.fasta has DTRs/circularity
WebsterMelRebuild378.fasta has DTRs/circularity
WebsterMelRebuild4581.fasta has DTRs/circularity
WebsterMelRebuild4643.fasta has DTRs/circularity
WebsterMelRebuild4835.fasta has DTRs/circularity
WebsterMelRebuild4861.fasta has DTRs/circularity
WebsterMelRebuild649.fasta has DTRs/circularity
WebsterMelRebuild651.fasta has DTRs/circularity
WebsterMelRebuild885.fasta has DTRs/circularity
no reads provided or reads not found
Circular fasta file(s) detected

Putting non-circular contigs in a separate directory
time update: running IRF for ITRs in non-circular contigs 03-11-21---09:25:34
time update: running prodigal on linear contigs  03-11-21---09:25:42
time update: running linear contigs with hmmscan against virus hallmark gene database: standard  03-11-21---09:27:10
time update: Calling ORFs for circular/DTR sequences with prodigal  03-11-21---09:27:55
time update: running hmmscan on circular/DTR contigs  03-11-21---09:27:56
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory
 Grabbing ORFs wihout RPS-BLAST hits and separating them into individual files for HHsearch
time update: running HHsearch or HHblits  03-11-21---09:27:57
 Combining tbl files from all search results AND fix overlapping ORF module
No ITR contigs with minimum hallmark genes found.
Annotating linear contigs
time update: running BLASTX, annotate linear contigs  03-11-21---09:27:57
time update: running Prodigal, annotate linear contigs  03-11-21---09:31:18
time update: running hmmscan1, annotating linear contigs  03-11-21---09:31:20
time update: running hmmscan2, annotating linear contigs  03-11-21---09:31:22

Been sitting at this point for 2 hours, with no tasks being executed (as far as I can guess, from htop)

mtisza1 commented 3 years ago

Hi Darren,

Thanks for reaching out, and I'm sorry that it's hanging on you. I'm working to figure out what's happening. Just to be sure that it's not an issue with your input fasta files (weird headers?), can you run the test contigs that are provided with the repo (e.g. testcontigs_DNA_ct2.fasta)? In the meantime I'll try to replicate this error.

Mike

DarrenObbard commented 3 years ago

Hi! Thanks for getting back to me so fast!

I'm hoping that cenote-taker2 will revolutionize my workflow (or perhaps just replace a post-doc)

My input is Trinity output from a few years ago ... [my understanding is that fasta makes no stipulation except that names start with a ">" followed by any characters at all, then a newline before sequence, and sequence continues until the next '>' ]

The test file turns up a new error, suggesting a library problem. I'm using the supplied conda environment on a pretty clean new Linux install (scientific linux, a redhat derivative).

I recently rean into this in another context - https://github.com/merenlab/anvio/issues/1479

when I was trying to set up a conda environment for the newest Trinity and Samtools, and it took an age to resolve - possibly because of a version conflict?

time update: running IRF for ITRs in non-circular contigs 03-11-21---14:07:15
time update: running prodigal on linear contigs  03-11-21---14:07:15
time update: running linear contigs with hmmscan against virus hallmark gene database: standard  03-11-21---14:07:17
time update: Calling ORFs for circular/DTR sequences with prodigal  03-11-21---14:07:20
time update: running hmmscan on circular/DTR contigs  03-11-21---14:07:20
Annotating DTR contigs
Traceback (most recent call last):
  File "/data/home/dobbard/miniconda3/envs/cenote-taker2_env/bin/circlator", line 57, in <module>
    exec('import circlator.tasks.' + task)
  File "<string>", line 1, in <module>
  File "/data/home/dobbard/miniconda3/envs/cenote-taker2_env/lib/python3.6/site-packages/circlator/__init__.py", line 26, in <module>
    from circlator import *
  File "/data/home/dobbard/miniconda3/envs/cenote-taker2_env/lib/python3.6/site-packages/circlator/bamfilter.py", line 2, in <module>
    import pysam
  File "/data/home/dobbard/miniconda3/envs/cenote-taker2_env/lib/python3.6/site-packages/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *
ImportError: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/data/home/dobbard/miniconda3/envs/cenote-taker2_env/bin/circlator", line 57, in <module>
    exec('import circlator.tasks.' + task)
  File "<string>", line 1, in <module>
  File "/data/home/dobbard/miniconda3/envs/cenote-taker2_env/lib/python3.6/site-packages/circlator/__init__.py", line 26, in <module>
    from circlator import *
  File "/data/home/dobbard/miniconda3/envs/cenote-taker2_env/lib/python3.6/site-packages/circlator/bamfilter.py", line 2, in <module>
    import pysam
  File "/data/home/dobbard/miniconda3/envs/cenote-taker2_env/lib/python3.6/site-packages/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *
ImportError: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
 Grabbing ORFs wihout RPS-BLAST hits and separating them into individual files for HHsearch
time update: running HHsearch or HHblits  03-11-21---14:07:24
 Combining tbl files from all search results AND fix overlapping ORF module
No ITR contigs with minimum hallmark genes found.
Annotating linear contigs
time update: running BLASTX, annotate linear contigs  03-11-21---14:07:24
time update: running PHANOTATE, annotate linear contigs  03-11-21---14:07:52
time update: running Prodigal, annotate linear contigs  03-11-21---14:07:56
time update: running hmmscan1, annotating linear contigs  03-11-21---14:07:57
time update: running hmmscan2, annotating linear contigs  03-11-21---14:07:57
time update: running BLASTN, linear contigs  03-11-21---14:08:00
Internal3.blastn.out not found
Internal4.fna  is closely related to a virus that has already been deposited in GenBank nt.
time update: running RPSBLAST, annotating linear contigs  03-11-21---14:11:47
/data/home/dobbard/scratch/test_cenote/Internal/no_end_contigs_with_viral_domain/COMBINED_RESULTS.rotate.AA.rpsblast.out
time update: running tRNAscan-SE  03-11-21---14:12:04
 Grabbing ORFs wihout RPS-BLAST hits and separating them into individual files for HHsearch
time update: running HHsearch or HHblits  03-11-21---14:12:05
/data/home/dobbard/scratch/test_cenote/Internal/no_end_contigs_with_viral_domain/Internal.rotate.out_all.hhr
 Combining tbl files from all search results AND fix overlapping ORF module, linear contigs
finalizing taxonomy for linear contigs
time update: finished annotating linear contigs  03-11-21---14:12:27
time update: running tbl2asn  03-11-21---14:12:28
[tbl2asn] This copy of tbl2asn is more than a year old.  Please download the current version.
[tbl2asn] Flatfile Internal3

[tbl2asn] Validating Internal3

[tbl2asn] Flatfile Internal4

[tbl2asn] Validating Internal4

Making gtf tables from final feature tables
removing ancillary files

time update: Finishing  03-11-21---14:12:28
Virus prediction summary:
4 virus contigs were detected/predicted. 2 contigs had DTRs/circularity. 0 contigs had ITRs. 2 were linear/had no end features
grep: DTR_contigs_with_viral_domain/DTR_seqs_for_phanotate.txt: No such file or directory
grep: DTR_contigs_with_viral_domain/DTR_seqs_for_phanotate.txt: No such file or directory
output directory: Internal
 >>>>>>CENOTE-TAKER 2 HAS FINISHED TAKING CENOTES<<<<<<
mtisza1 commented 3 years ago

Hmmm. OK, based on the Anvio issue you referenced, maybe something is bugging out with circlator, and you can try to reinstall it like this?

conda install -c bioconda circlator=1.5.5 --force-reinstall

I sometimes regret having so many packages installed with Cenote-Taker 2 because if one of them breaks, the whole thing breaks. But I also didn't want to reinvent the wheel...

On the other hand, it seems like the error regarding "line 547: s/#/ /g" is no longer occurring with the provided test contigs, making me believe that Cenote-Taker 2 was mishandling the fasta header from your original runs. Could you do me a big favor and send some of the fasta headers from these files:

grep ">" LongWebster.fasta | head

grep ">" LongWebster.over_1000nt.fasta | head

DarrenObbard commented 3 years ago

Without fixing the libcrypto.so.1.0.0 problem, I have cleaned up my sequence titles (no funny characters at all!) and it hangs in the same place as before.

It seems to die during

time update: running hmmscan2, annotating linear contigs  03-11-21---14:30:43

And this seems to be the last sequence it was looking at when it stops:

>CleanWebster15 TR29739c0_g2_i1_len5765
CAGAGCTAGATTTTATTGCGGTACAATATTATTATCACGAATGTTTAAACAAGATTTACAATTTGAAGAAAATGGAATTAAACCCTATGTGTGTGTTAGAAATGGAGGAAATCGAGCATGTTGCCGATTTTACCCCTTTTCATATAGAGAATACGCCACATTTGAGTATAAATTCCCTTTTGACCCGGAGGTTGAAGAGGAAAATGAAGAAGTGGTATCAACGAACTATTTTGCCATGTTGGCAGAATTTGTCTTGGGTATCAGTTATTTTTGTGTGCTTTATCAGCTACTTACTATATCGTCGTGGAAGAAGATTT
ATAGTGGGTATGCCCAGTGCGCAAGCAAAGAGAGATGTTACAAACTCGTGGTCGCATCTGAGAAAGGAGTTTTAGAAGTAGAAGTGAAGGGATATCATAAGCGTACTATTAAGCAATTTGCTAATTATATGGCTGTGGTAATTTTGAAGGAATATTTGACTAAAGAACAGGTACAGCAAATGTTATTTTATTATTCTAATATATTTGCATATGATGATGATATTTGTGAAGTGCAGGCAGAAAATTCTCACCCGAAAGAATCGGTTCAGGGTGAGGAAGTTTTGACAGGTACAAAACATAGTAATACTATTTTAACT
AATAGTACAGGAGATACAGAGAGTATACCTCTAGCAATTAGAGATGATACTTTGAATTACGCCTCGAGCGAAGCCTTACATCAATTTGATAGTTTAACTGATAGATGGATGCCGTTAGAAACAATAACAGTTACTACATCACAGATTTCTGGTACACTATTAAAGGAATGGTATTTACCATATGATTTGTTGCAATCTCATATTATAAATCCGAGTTTAGCTCCATTTATGCTATTTCGCTACGGTGCTTTATCAATAGAGATGAAATTTGTAGTGAACGCTCACAAATTTCAATCCGGTAAAGCCTTAGCGAGCAT
TAAGTATGATCCAGTCGGTTTAACAGATTTTGGTGATTCATTACCTACATGTTTGCAACGAGAGCACGTGATGTTAGACTTATCTACTAATAATCAAGGAACATTGCAAATTCCTTTTATTTACCATCGTTCGTTCTTGCATTTAAATTTGCAGCAAGGTACAGATCAAACCATGGTACCATCCACATATGCTAGAGTACAGTTACACATCCTGGCCAATTTATTAACAGGAACTAATCAAGCAGTTAGCATGAACATCCGTCCTTATTATCGCTTCTCGAAAGCTTCATTTGCTGGAATGGAAGCAGTTCATACTG
TCCAGATGGATGTGGATGCAGTTGTAAAGGGATTAATACCAACAAAATCATTGAAAGCGGTGTTAGTTGGCGCAGAGGCTCTTATAGATCAATTAGGGAAGACTTGCAACCAGGACAAGCCTACAATTACTTCTTCCACTCAAATTGTTCCGAAACCCCGCAGTCAGTTTGCATCAGGAAAGGGGATTTTCGGAGGAACAGTTCTGAGATTAAATCCGCAGGTAATCACGTCTGCAGTTGAAGTGAAACAATCATCACGTACCCCTAGAACTGTACTGGATATAGCTAGAGTATGGGGATTGAAGAAAATTATGACG
TGGACTACGAATGCTAAACCAGATGAGCACCTTGATGATATTGTGGTTGATTTGCACCATAATTTTAAAGGGGGTAATGATCGTATTGAAGCAAATATATTGACTCCAGTTGAATATATAGCGTCTTTATATGGATTTTGGTCAGGGACATTAGAATGTAGGTTGGACTTTATATCCAATCAATTTCACACTGGTGCTATTATGATCAGTATACAAGTATCAAATCAAGAGACAAAATTTCAAAAGGCGGCTTGTGTATATACTAAAATTTTCCATTTGGGGGGTCAGAAAAGCGTCACATTCACCATTCCTTATAT
ATACGATACTATATGGCGTCGTAACACAGCTCAAATATTTACACCTTACACGTTTGAGCAAGATAATAAACTCCCTGTAGATCATATATTTACACTCGGTACGAATGATTTTATGAGAATCCAATTTTATGTTGTTAATGAATTACGAGCTCCAGATACAGTAGCGAATGTAGTTCAAATATTAGCTTATGTACGTGCGGGGACTAGTTTTATGTTACATTCTTTAAAACCGTCGCATTTGGAAGTTATACAGGACATAGCTCTTTTTAGAGACATACCTATGTTTAATGTACCTCATTTGGCACCTAAATCTTATA
TAACTAAGTCTGAGGAAAAACACATCAAGTTAACGAAAGAACTAACACTGGAGTATAAAGAAATCAAGTTTCAGATGGAAGGCTCCTTAGCTGAGAATCCAGATGAAACTCCTGATTTTAGTGCGGGTTTGAATGCTTTGCATATACAAACTTTAGATTCTCAAGTTAATATAAAGGATATTTTAAGGCGTCCTATACAGTTAACAAAAGCTATATCTTTTAGTAATACTGAAATAAAGAATCATGTATCTCTTTTTATCCCTTTAATGGTCCCATCTCATAATATGGTATATTCGGATAGTTATGAAACCATATAT
GCGGATGGAGTTTCCCTTACACCAACCGCTATGCTAATGAATTTATTTCGTTTTTGGCGAGGTAGTATGCGTTTTACCTTTGTTGTAAACGATAATGTATCCAAGAATTGTACACATTGGATAACTCACATGCCCCATTCGGGAGTTCGGAAAATTGGAAAGATTGAATTTCCAAAAGGTCCGAGTTTAGTTGGATCATCATTTGCTAGTGTCCCACTAGTCGCCAACATCAACGCGACGGAATGTGTCGAGGTACCCTATGATACGGAATTAAACTGGACGCTGTGTCATTCAGCTCGAAATAACCAAATCTTATC
AGTAAGAGATCAAACAGATACTAATGCAGGACATATAGTATTTACACCATCTGGTACATGTGATGTTACAGTGTGGTGGGAAGCTGGGGACGATTTTGAATATGAGAATTTCTTAGGAGTTCCGGCTACCATCACACGGGATCGTTTGCACGGTGTATACGAAACGGAAATTAAATTCCAAGCAGAAACATCAATGTATTCCAAAACCCTTGCGAAAGTGAATACTATAATAAATTTGCCAGAGCAGATAGCAGATACATTAACGAATGCTAATAATGTTGGTGACGCTATTATAGCGAGTTCTACGAAAGCAGAAA
AATTATTAGTCAAAGGGTTAGAAGTGTGCGAGAATGCATCAGCTATGTTAGATAATATTTCTCCTTTGATGGAATCTTTAGAGGAAAAAATTCGGGAATCCTTAAAATCATTTCCTGGAAGTATTTATAATTCTACAATGTTTATTCAAAATGGGGTTGAAATTATAATGGATTTAGTTGTCGCTTGGTTATCTGAATCGTGGGCCGTACTTGGTAATATTTTCGTCAAAGCTATAGCACGGTTGCTGGGATTTAGTGCCATACAAACTATTTTGAAGTACGGTTCCCAAATAGCCGCTGCTATTCGTAATCTGGTG
AACCCACAAATAGTAGTTCAGGCTCCATCGCAAAATGTCACATTATTGGGAGTATTATGTGGTTTAGTAGGTACAGTAGTGGGTGTATCTCTGGAAACCCAAAATTATTCTAAGTTTATTTATAAATTGTCTGAAAGATTTGTGACAACTGGGGGTATAGCTTATCTTAATCAAGTCTTACGGTTTGTGCAGAGTACCTTTGAAGTTATTCGTGACTTGGTGATGGATGCCCTTGGTTACGCTGATCCTAATGTAAAGGCTTTACAGATGCTCAGTAAAGATACAGGTGTAATTAGCACATTTGTAAAGGAGGCTAA
TGTCATATTAAGTGAAGCGAACGCCTCATTATTGTCAGATCCCGGTTTTCGTAAACGTTTTTGGTACACTGTGTCTCAGGCATACCAAATTCAATCAATTCTAGCCGTGAGTCCTGCGAATGTAGTTTCACCCATTGTGACTCGTTTATGTACCGATGTCATAAAAGCATCGAGTGAAAAGTTCATGGACTTATCGTGTAGTCCTTGTCGCTACGAACCATTTGTGATTTGTATAGAGGGTGAACCTGGTATAGGAAAATCTTTTATGACAGAGACCATGGTTTCCGAATTGCTTGGATCAATTGGTTTCGATCGTC
CATCCAGTGGCTTAATTTACACTCGGCCTCCTGGAGCACGATTCTGGTCAGGATATAAAAATCAGCCTGTAGTTGTTTATGATGATTGGATGAATTTGAACGATTCAGACCAAATACTGAGTCAGTTAAGTGAATTGTACCAGATGAAATCAACTAGTGATTTCATTCCAGAAATGGCTCACTTAGAAGAAAAGAAAATCAAAGCGAACCCTTTAATTGTCGTGCTATTGTGTAATGGTGCATTCCCCTCGTGTATAGGTCAAAAAGCGATTTATCCTGATGCTATTTTCAGACGTCGAGACTTAGTTTTGCGAGCC
TCTCTGAAGGAAGAATGGGTAGGAAAAGATTTACGCGACCTAACTGATAGTGAATCAGCTGAGTGTGGACATCTATTGTTTCAACGATATACTAGTGCGAAAATTGAGAATAGTTTAACCACAGCTCAAAAGACCTGGTCTGAAGTAAAACCTTGGTTGTGTGCCACATATAAACGCTACCACCAACAAGAAACACTTTTAGTACGTAAAAGAATTAAAAAGTTTCAAACTCAGATGCGTTTAAATAGTGAGAATTATCTAGACTATTCAGATCCTTTTTCTCTATTCTACACTAGCACCATTGATGTTATGGAAGA
CTCTGAGTGTAATCCTAATGGGTGGTTACCTAGTGAACAATTGGAGGCAGCTGTGTTGAGAGTTGTTGATATAATAAAGGAGAAGAAGGACGAAGTATTGGAATTTCATATAGATTCTAAACCTGAAAACGTCTTTCAGGGCTTTCCGGTGGGATGGGAAGATCTATCAATGAGCTTAACTAGTGGTATACTTTTTAGTGGAGGTGTTATGGCGCAAGTTTTAGACTGGACCGCTCAGGGTATAGGAGCTTTCATGAAACCACTATTAGAAAGTACGGGTCAGAGTATAGAACACGAGTGTATGACATGTCTTGAGC
AAATGCCCTGTTACTACGTATGTGGAGGTGTGCGTTCCCACTCTAACCCCAAAGCTCATCATTACATGTGCATGGATTGTATGATTCGCATGAAGCGAGCTAATATGGGTTCTCACTGTCCCATGTGTCGTGTAGAGCCTATGCTAGCTTGTTTACCTAAACATCTAACTCGCTTGTATATAGTGTTACGTTGGGCGTTGGTTAATGTTAGTGATAGATTAGTATGGATTTTTGCATTCTTTAGGGATTTTCTCCGTTCAAGGTCTATGGTAAATTCACGCTTATTATTATCTACCCTGGCATCATTAACTGCATTC
TTACAGGGCGATGGTATTACAACTACCATTGCTGCTTCATATGTAGGGGCAAGTGTGGTAGATGCTATATATGATCCAGAATTATTTACTAATGTAGCACAATCCTGGATATTTAACCCCTTGGATATGTTAGTTCCTTCAGAAGAATATTACACGCCTCCTTCGGAAATAATAAACGCTAGCGTGCAATGCATGCAGTTTGAAAGTCTTGGGCAGAGAGAGGTTGGTTGTAGCAACCTTGAGCCGGAGAAAGATTCATGGGATGTACTTACTCCTAAAGAAGAGGCTATACTTCGTTGTGAACGCAATAAGAACAA
AATGGATACTGCCTTAGTTATAAACAAAGCAGAACTCGAAAATATTCGAAAGAAGCGGG

after successfully writing a blank file called "CleanWebster15.all_called_hmmscans.txt"

I'm trying one on this sequence alone ....

DarrenObbard commented 3 years ago

Part I - sequence names

The old-style Trinity headers had a nasty '|' , but also '=' and '[' and ']' and ' '

TR29739|c0_g2_i1 len=5765 path=[11551:0-1439 11555:1440-3480 11548:3481-5764] [-1, 11551, 11555, 11548, -2]

but I've cleaned this to

TR29739c0_g2_i1_len5765

Run on its own, the sequence above is OK, so maybe that wasn't the cause ...

Part II, Circulator

looks promising:

Solving environment: done

## Package Plan ##

  environment location: /data/home/dobbard/miniconda3/envs/cenote-taker2_env

  added / updated specs:
    - circlator=1.5.5

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2020.12.5          |   py36h5fab9bb_1         143 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         143 KB

The following packages will be UPDATED:

  certifi            pkgs/main::certifi-2020.12.5-py36h06a~ --> conda-forge::certifi-2020.12.5-py36h5fab9bb_1

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    pkgs/main::ca-certificates-2021.1.19-~ --> conda-forge::ca-certificates-2020.12.5-ha878542_0

but no,

ImportError: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
mtisza1 commented 3 years ago

OK, I believe I've figured out at least one issue. Thank you for bearing with me here.

The circlator issue may actually be a pysam issue per: this issue

Can you check your pysam version (should be 0.15.3) and update if necessary

$ conda list | grep "pysam"
pysam                     0.15.3           py36hda2845c_1    bioconda

conda install -c conda-forge -c bioconda pysam==0.15.3

The other issue may have to do with a problem on my end that I've possibly fixed. The trinity headers were not the issue. You've got RNA virus contig(s) where the whole contig is covered by an ORF that may not have a start and stop codon. I had incorrectly coded prodigal to use -c for closed genomes for these step, requiring start/stop codons. The program is expecting at least 1 ORF, and it's not there due to this setting. I should have tested these types of contigs before releasing the update! If you do cd Cenote-Taker2 then git pull. I think this should fix it. If you forgo the blastn step when you test this, you should get quicker results.

Let me know if this helps.

DarrenObbard commented 3 years ago

Hi! Fantastic, thank you.

The pysam was indeed the issue, and the test file now runs happily!

My own trial dataset (with the long ORF that lacks a start of stop, and the nasty headers) now runs to completion!

But there are still some things that worry me ...:

This still happens:

/data/home/dobbard/apps/CenoteTaker2/cenote-taker2.1.1.sh: line 547: s/#/ /g: No such file or directory

And when running blastn, what do lines like this imply?

MediumWebster1462.blastn.out not found

Is it just a virus / phage not in nt?

Then I get some hits that report like this:

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata; Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura; Cyclorrhapha; Schizophora; Acalyptratae; Ephydroi$
ea; Drosophilidae; Drosophilinae; Drosophilini; Drosophila; Sophophora; melanogaster group
; PREDICTED: Drosophila bipectinata twitchin (LOC108134366), transcript variant X2, mRNA
; PREDICTED: Drosophila bipectinata twitchin (LOC108134366), transcript variant X1, mRNA
; PREDICTED: Drosophila ananassae twitchin (LOC6501771), transcript variant X6, mRNA

What's the cause of this?

Then at the end I get a lot of this:

Virus prediction summary:
50 virus contigs were detected/predicted. 0 contigs had DTRs/circularity. 0 contigs had ITRs. 50 were linear/had no end features
grep: no_end_contigs_with_viral_domain/LIN_seqs_for_phanotate.txt: No such file or directory
grep: no_end_contigs_with_viral_domain/LIN_seqs_for_phanotate.txt: No such file or directory
grep: no_end_contigs_with_viral_domain/LIN_seqs_for_phanotate.txt: No such file or directory
grep: no_end_contigs_with_viral_domain/LIN_seqs_for_phanotate.txt: No such file or directory

What does this indicate?

Thanks!

Darren

mtisza1 commented 3 years ago

Darren, I again thank you for raising these issues, and I apologize that my testing wasn't as thorough as I thought. please do git pull again. Everything should be fixed and I have 2 questions for you.

I fixed the error with this s/#/ /g

As you thought, MediumWebster1462.blastn.out not found implies that it doesn't have a strong BLASTN hit in your database. I changed the message to say sequence.blastn.out not found, no close BLASTN hits for this sequence.

Regard the blast reports, you have the phylogeny of the top hit on the first line, then the description of the top 3 hits. The description of the top hit is also in the note in the ".gbf" and ".fsa" files in the sequin_and_genome_maps directory. I don't really know exactly what users want to do with BLASTN info. What are your thoughts? Should it inform taxonomy in the output?

I also fixed the error with grep: no_end_contigs_with_viral_domain/LIN_seqs_for_phanotate.txt

My other question is, I know your lab has found some interesting segmented RNA viruses. You could of course use Cenote Taker 2 with -am True on a multifasta of segments from the same virus, but it might be confusing to have a separate ".gbf" for each output. I haven't looked into generating combined outputs for segmented viruses. I could possible add this feature if you have some insight into the formatting, etc.

DarrenObbard commented 3 years ago

Hi!

Thank you for the pipeline! I have played around with several virus finders, and I have never previously found one that I thought worked well enough to use. I'm thinking we might start to use this routinely - so you're going to have to keep maintaining it!

Regard the blast reports, you have the phylogeny of the top hit on the first line, then the description of the top 3 hits. The description of the top hit is also in the note in the ".gbf" and ".fsa" files in the sequin_and_genome_maps directory. I don't really know exactly what users want to do with BLASTN info. What are your thoughts? Should it inform taxonomy in the output?

So, as you might imagine, I have some opinions to share! I think this blastn screen (I'm using nt at the moment) is really useful, but I think you should make more use of it for the taxonomy. It looks like your taxonomy might be based on refseq? For viruses refseq is always so out of date as to relatively little use for spotting 'known' viruses.

I think that, where the blastn is currently reported, it could be done more cleanly- purely as taxonomic information. So, leaving out the gene/segment etc etc and just report the top hit with "Sequence identity 98% to ". This would be a really clear sign that the user might consider it a previously reported virus, or not (they can choose the threshold). I think this should be in the all the outputs it can be, including the overall summary table. In fact, if you have a 90% plus blastn hit over the whole length, I would replace any proposed taxonomy based on more sophisticated approaches.

Even better than the HSP identity would be a quick pairwise alignment between the new contig and its top blastn hit, and report the overall sequence identity for the shared length.

My other question is, I know your lab has found some interesting segmented RNA viruses. You could of course use Cenote Taker 2 with -am True on a multifasta of segments from the same virus, but it might be confusing to have a separate ".gbf" for each output. I haven't looked into generating combined outputs for segmented viruses. I could possible add this feature if you have some insight into the formatting, etc.

I think this would be great! I think genbank file could literally just be concatenated, as could gtf files to go with fsa files. I don't know if its too ugly, but folders could be created to hold the un-concatenated versions - then the concatenated file names could match the folders

I have a number of other questions / suggestions. Would you like them here, or by email?

mtisza1 commented 3 years ago

Thanks for the feedback. Let's discuss further by email, and I'll make sure to include any changes that get made into the change log for the next update. michael.tisza@gmail.com