mtisza1 / Cenote-Taker2

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)
MIT License
56 stars 7 forks source link

Circularity detection and BlastN #15

Closed KailunZM closed 2 years ago

KailunZM commented 2 years ago

Hello Mike,

I'm using Cenote-Taker2 to discover viral elements in my samples. I'm really glad that there are lots of viral genes are identified but I have two questions and hope to have you suggestions.

  1. The end_feature of some metagenomic plasmids are labeled as NONE. Screen Shot 2021-10-01 at 9 58 28 AM

I got different end_feature results of analyzing the same assembly data file when I run it separately (s00.1003single.txt) or with other samples (s00.1003list.txt). The latter one means that I have a list of assemblies (scaffold_list.txt) from SPAdes and I run the program follow the list. Below are the scripts for both runs. The first sample in the list is 1003-01.fasta. What I found was that, when it was analyzed separately, the end features of some viral sequences could be annotated as DTR (as the picture shown above). However, when I run it in a list, all of the metagenomic plasmids were labeled as NONE, though other info were the same such as LENGTH.

Also, different outputs were observed. In separated run, I got AA.fasta files (e.g., d00_ct1003single1033.AA.fasta.txt) but when run in list, I got permuted.fa files (e.g., permuted.1033.fa.txt) instead. I know it's really confused and I'm not sure if these problems are connected to each other. I attached the log files for both runs and please let me know if you need more information to figure the problem.

s00.1003single.txt s00.1003list.txt

Log file for separated running: x_dna_test_33051552.txt y_dna_test_33051552.txt

Log file for running in the list: x_dna_test_1.txt y_dna_test_1.txt

  1. I turned on blast function as followed script and expected to identify some highly conserved sequences. But all samples showed "no high coverage hits." I want to know if I make any mistakes. Thank you!
--known_strains blast_knowns \
--blastn_db ftp://ftp.ncbi.nlm.nih.gov/ \

Best, Kailun

mtisza1 commented 2 years ago

Kailun,

Thanks for using the tool and for putting in an issue. I think I have answers to both of your questions. I do want to apologize for the slow reply, but I haven't been at work most of this week, so I'm just now getting a chance to get to your question.

Regarding the differences with circularity detection:

The problem is that your -r RUN_TITLE option in the "list" script ("1003-01") is formatted in a problematic way. Cenote-Taker 2 run titles need to be "less than 18 characters, using ONLY letters, numbers and underscores (_)". I know this is probably frustrating, and, honestly, I should have a step to check that the format is correct (I'll put it on the to-do list). Sorry!

Regarding the BLASTN step:

The blastn DB need to be downloaded locally. If the entirety of nt is too large, you could get the virus subset of nt.

Please let me know if this helps.

Mike

KailunZM commented 2 years ago

Hello Mike,

I really appreciate your response and it's helped me fix the circularity detection issue. I basically just replaced "-" with "_". I guess some contigs annotated as "metagenomic plasmid" would just not be labeled as DTR? Also, I found that some of the circular contigs included in "all_circular_contigs.fna" file would not be summarized into the "CONTIG_SUMMARY.tsv" file, such as all_circular_contigs_1003_01.fna.txt and 1003_01_CONTIG_SUMMARY.tsv.txt (the output file looks like this Screen Shot 2021-10-11 at 10 18 49 AM). I hope this is correct.

For the BlastN, I downloaded nt files into our virtual system but it still did not detect any high coverage hits. Here are my steps using Linux:

  1. Download nt files using perl ../bin/update_blastdb.pl --passive --decompress nt into directory /scratch/ref/gdlab/blast_db/nt_2021_10_07
  2. Then run a bash job:
    module load miniconda3
    CONDA_BASE=$(conda info --base)
    source $CONDA_BASE/etc/profile.d/conda.sh

activate env

conda activate /opt/apps/labs/gdlab/envs/cenote-taker2/2.1.3/cenote-taker2_env CENOTE_BASE="/opt/apps/labs/gdlab/envs/cenote-taker2/2.1.3/Cenote-Taker2"

DATA_BASE="/scratch/gdlab/kailun/Phageome/NEC/metaspades"

sample=sed -n ${SLURM_ARRAY_TASK_ID}p ${DATA_BASE}/scaffold_list.txt sample_out=sed -n ${SLURM_ARRAY_TASK_ID}p ${DATA_BASE}/scaffold_list_new.txt

set -x time ${CENOTE_BASE}/run_cenote-taker2.py \ -c ${DATA_BASE}/scaffolds/${sample}.fasta \ -r ${sample_out} \ -p True \ -m 30 \ --orf-within-orf True \ --known_strains blast_knowns \ --blastn_db /scratch/ref/gdlab/blast_db/nt_2021_10_07 \ -t ${SLURM_CPUS_PER_TASK} RC=$? set +x

if [ $RC -eq 0 ] then echo "Job completed successfully" else echo "Error Occured!" exit $RC fi

The log file: x_dna_test_33239020_1.out.txt y_dna_test_33239020_1.out.txt

I'm not sure if it really doesn't have any conserved sequence or I did something wrong. It seems not related with format. What would be the rate of coverage hits detection from your experience?

Another questions of mine is if the bacterial false positives are removed during Cenote-Taker2 analysis? Thank you!

Best, Kailun

mtisza1 commented 2 years ago

Kailun,

I'm glad we could fix some of your issues. Yes, plasmids are not identified by being circular contigs, but rather by their genetic content.

Regarding circular sequences that are not in the contig summary table: These contigs didn't have virus hallmark genes, and are probably plasmids or other sequences that assembled as circles for one reason or another. While Cenote-Taker 2 does spit out some "metagenomic plasmids", these are sequences that have genes that were preliminarily marked as potentially viral, but the secondary filter (using BLAST) showed that they are likely plasmids instead. Same with the "Conjugative Transposon" sequences. There are almost always several of these sequences when you use -db standard with input of metagenomic contigs that are not derived from virus particle preparations.

I actually suggest using -db virion with these data, but it's up to the user in the end. If you want to know what all the circular sequences are, you could use the "all_circular_contigs_1003_01.fna" file as an input for a new Cenote-Taker 2 run using options --circ_minimum_hallmark_genes 0 or simply '-am True`.

A final note on this, you could use the option --filter_out_plasmids False on your runs if you want more plasmids. This uses an expanded HMM database with plasmid replication proteins. Honestly, there are probably better programs to use if your goal is to systematically identify plasmids (though I've noticed that many "plasmid" databases contain circular phage genomes).

Regarding BLASTN:

I think you are specifying the BLASTN directory instead of the the BLASTN database name. For example I have a directory /fdb/blastdb/ with a database called nt

$ ls /fdb/blastdb/nt* | head
/fdb/blastdb/nt.00.nhd
/fdb/blastdb/nt.00.nhi
/fdb/blastdb/nt.00.nhr
/fdb/blastdb/nt.00.nin
/fdb/blastdb/nt.00.nnd
/fdb/blastdb/nt.00.nni
/fdb/blastdb/nt.00.nog
/fdb/blastdb/nt.00.nsq
/fdb/blastdb/nt.01.nhd
/fdb/blastdb/nt.01.nhi

I would use --blastn_db /fdb/blastdb/nt with Cenote-Taker 2. It's hard to say how many close hits you'd expect to nt with your input contigs. If these are contigs from some weird hot spring or something, there could be no hits.

Regarding bacterial false positives:

I'm not exactly sure what you mean, but I can say that Cenote-Taker 2 is not perfect, and has some false positive rate. Using -db virion will give you fewer false positives, but might miss some subgenomic virus fragments compared with using -db standard. Every virus discovery program has false positives and false negatives, but I hope Cenote-Taker 2 helps you sort these out with inspection of genome maps. If the genome map looks "weird", I would discard the contig. Any feedback on false positives is always welcome as I polish the hallmark HMM database.

On the other hand, if you are asking if Cenote-Taker 2 removes flanking bacterial genes from contigs containing an integrated prophage, then, yes it does if you use -p True. Determining the bacteria/phage border is also an approximate exercise. You can check out the supplementary figures from the Cenote-Taker 2 manuscript to get a sense of its accuracy.

Let me know if this helps or if you have any other questions.

Mike

KailunZM commented 2 years ago

Hello Mike,

BLASTN has been working for me. Thanks a lot for your detailed suggestions! Though it doubled the running time than without Blast function, I'm really glad that to get this informartion.

For the false negatives, you've suggested to inspect genome maps. It makes a lot of sense when I look into the map of each sequence. However, I generated about 25k sequences and I'm not sure if I can look through all of them. Do you have any experience of doing it more effectively?

For figuring bacterial false positives, I'm using -db standard to identify all the potential viral genomes and trying to use BUSCO to determine whether the genome could be a bacterial false positive, following the procedure in https://www.sciencedirect.com/science/article/pii/S193131282030456X. It basically used bacterial universal single-copy orthologs (i.e., BUSCO) to assess the level of bacterial gene enrichment of viral contigs generated from VirSorter and VirFinder, and compared with Viral RefSeq genome, as I understand. I'll update you how the result looks like later, hopefully ^-^

Thank you for you help again! Please let me know if you have any good methods of map inspection.

Best, Kailun

mtisza1 commented 2 years ago

Kailun,

Great to hear! Yes, the BLASTN step does take a while (when using all of nt as a database) and I haven't figured out how to speed it up any further.

25k sequences is certainly a lot to look through. (Also, please note that metagenomic plasmids and conjugative transposons should NOT be considered viruses). My unsolicited advice in this project is to use caution with such large data. Personally, I would discard "linear" (i.e. non-DTR/circular and non-ITR) contigs under 5kb, and would again urge you to use -db virion for your data type. To cluster similar sequences and get a representative of each taxon, you can use checkv scripts (anicalc.py and aniclust.py) https://bitbucket.org/berkeleylab/checkv/src/master/

I've played around a little with the BUSCO aproach described in the excellent study above. It's OK, but (in my opinion) there are 2 points that it fails to account for: 1) There are a few "single-copy marker gene" HMMs (in the set that I believe the authors of this study used) that are actually annotated as being related to prophage/mobile genetic element replication genes. 2) many/most false positives will be other mobile genetic elements which would explicitly be missed by the BUSCO approach (though see point 1).

Best,

Mike

KailunZM commented 2 years ago

Hi Mike,

Really good points for BUSCO. Just to make it clear, because of point 1, I probably would remove some prophages, right?

I'll definitely learn deeply about CheckV. From my superficial understanding, the sequences with higher level of completeness or quality have higher potential to be real viral genomes and I need to pay more attention to judge the sequences with lower completeness.

Thanks!

Best, Kailun

mtisza1 commented 2 years ago

I think the BUSCO approach could remove a small number of prophages, but you could use the approach and just filter the prophage/mobile genetic element replication gene HMMs from your "hits". I don't recall which HMMs these were, however. I think there are several HMM sets you choose from when using the BUSCO tool.

Yes, the main function of checkv is to assign completeness to putative viral sequences, the second function is to prune prophage regions, and the third thing (anicalc/aniclust) is to cluster similar sequences. With a large scale project like yours, you might consider making a checkv cutoff (e.g. 10% completeness).

Good luck!

Mike

KailunZM commented 2 years ago

Got it. Thank you again for your answers and suggestions! They are really helpful.

Best, Kailun