mtisza1 / Cenote-Taker2

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)
MIT License
56 stars 7 forks source link

Catalog content no_end_contigs_with_viral_domain #31

Closed SergeyBaikal closed 2 years ago

SergeyBaikal commented 2 years ago

Good afternoon! Tell me please. Which of these files is common to all annotated proteins? Which can be taken for example for vConTACT analysis?

all_LIN_HMM2_proteins.AA.fasta all_LIN_rps_proteins.AA.fasta all_LIN_sort_genome_proteins.AA.fasta all_prunable_rps_proteins.AA.fasta all_prunable_seq_proteins.AA.fasta

mtisza1 commented 2 years ago

Hi Sergey,

To get the files for VContact2 from a single Cenote-Taker 2 run, go to the base directory of your run, and type these commands. It should be possible to combine these files from several runs, but just make sure there is only 1 header line in the genes-to-genomes file.

# specify summary file:
SUMMARY="test_ssd0_4ct_CONTIG_SUMMARY.tsv"
# make files for VContact2
echo "protein_id,contig_id,keywords" > vcontact2_gene_to_genome1.csv ; tail -n+2 $SUMMARY | cut -f2,4 | while read VIRUS END ;do if [[ "$END" == "DTR" ]] ; then AA=$( find . -type f -name "${VIRUS}.rotate.AA.sorted.fasta" ) ; else AA=$( find . -type f -name "${VIRUS}.AA.sorted.fasta" ) ; fi ; grep -F ">" $AA | cut -d " " -f1 | sed 's/>//g' | while read LINE ; do echo "${LINE},${VIRUS}" ; done >> vcontact2_gene_to_genome1.csv ; cat $AA >> vcontact2_all_proteins.faa ; done

This will make files: vcontact2_all_proteins.faa and vcontact2_gene_to_genome1.csv

I hope this helps!

SergeyBaikal commented 2 years ago

Thank you very much for your prompt response Mike! Yes, it helped me.