Qingxiang (Allen) Guo
Postdoctoral Fellow
Northwestern University, Feinberg School of Medicine
Here we propose an efficient framework for constructing the comprehensive protein reference database, "customized comprehensive proteomic reference database (CCPRD)", which incorporated the draft genomes and transcriptomes. Compared with previous protocols, our method has superiorities in peptide and protein identification numbers, number of entries database size, and completeness.
Perl version >= 5.10, Python version = 2.7, acc2tax version 0.4, Anvi’o version 2.0.2, Augustus version 3.2.2, BLAST version 2.2.25, Blobtools version 0.9.19, Bowtie2 version 2.2.9, CD-HIT version 4.6.6, Diamond version 0.8.31.93, EMBOSS version 6.6.0, Ete 3, EVidenceModeler version 1.1.1, GeneMarkS-T version 5.1, GeneMark-ES-ET version 2.5m, PASA version 2.2.0, RepeatMasker version 4.0.5, RepeatModeler version 1.0.4, Samtools version 1.3.1, SNAP version 2006-07-28, SOAPdenovo version 2.04, TGICL version 2.1, Tophat version 2.1.1, TransDecoder version 3.0.0, Trinity version 2.2.0, Wise version 2.4.1
1.Assembly the transcriptomes and genomes
1.1 Assembly the transcriptomes from clean data
Trinity --seqType fq --left M.wulii_1_fixed.fastq --right M.wulii_2_fixed.fastq --max_memory 200G --min_contig_length 200 --min_glue 3 --group_pairs_distance 250 --path_reinforcement_distance 85 --min_kmer_cov 2 --jaccard_clip --normalize_reads --CPU 32
mv Trinity.fasta wulii.fasta
1.2 Remove redundancy in transcriptomes with CD-HIT
cd-hit-est -i wulii.fasta -o wulii.fasta_cdhit -c 0.95 -n 10 -d 0 -M 16000 -T 8 &
1.3 Further cluster the transcriptomes with TGICL
tgicl -l 40 -c 10 -v 25 -O '-repeat_stringency 0.95 -minmatch 35 -minscore 35' -F wulii.fasta_cdhit
1.4 Process the cd-hit results to get unigenes
fast_extract_seq_from_fasta.pl wulii.fasta_cdhit wulii.fasta_cdhit.singletons > extracted.fasta
cat asm_1/contigs asm_2/contigs asm_3/contigs asm_4/contigs > all.contigs
Unigene_generator.pl -s extracted.fasta -c all.contigs -t WL
1.5 Assembly the genomes from clean data
SOAPdenovo all -s lib.cfg -K 51 -D 1 -o WL >> soap.log
mv WL.scafSeq genome.fasta
Gapcloser -a genome.fasta -b config.txt -o gapcloser.fasta -t 32
mv gapcloser.fasta WL_genome.fasta
2.Remove potential host and bacterial contamination in transcriptome data with "conservative reciprocal best blast hit" method
2.1 Collected all available proteins and nucleotide sequences for constructing host or close-related species no-redundant database
2.2 Remove redundancy in each database
cd-hit-est -i nucl_host.fasta -o nucl_host_cdhit.fasta -c 0.9 -n 8 -T 8
cd-hit-est -i nucl_myxo.fasta -o nucl_myxo_cdhit.fasta -c 0.9 -n 8 -T 8
cd-hit -i prot_host.fasta -o prot_host_cdhit.fasta -c 0.9
d-hit -i prot_myxo.fasta -o prot_myxo_cdhit.fasta -c 0.9
2.3 Add tags in databases
replace_header_for_cdhit.pl -c nucl_host_cdhit.fasta -t HN
replace_header_for_cdhit.pl -c prot_host_cdhit.fasta -t HP
replace_header_for_cdhit.pl -c nucl_myxo_cdhit.fasta -t MN
replace_header_for_cdhit.pl -c prot_myxo_cdhit.fasta -t MP
2.4 Start hybridization
tblastn -query HP_cdhit.fasta -db /home/gqx/transcriptome/assembly/M.wulii/clustering/MWdb -max_hsps 1 -out result_HP -evalue 1e-5 -outfmt 6 -num_threads 8
tblastn -query MP_cdhit.fasta -db /home/gqx/transcriptome/assembly/M.wulii/clustering/MWdb -max_hsps 1 -out result_MP -evalue 1e-5 -outfmt 6 -num_threads 8
tblastx -query MN_cdhit.fasta -db /home/gqx/transcriptome/assembly/ M.wulii/clustering/HHdb -max_hsps 1 -out result_MN -evalue 1e-5 -outfmt 6 -num_threads 8
tblastx -query HN_cdhit.fasta -db /home/gqx/transcriptome/assembly/ M.wulii/clustering/HHdb -max_hsps 1 -out result_HN -evalue 1e-5 -outfmt 6 -num_threads 8
2.5 Process above results and remove the transcripts that only matched to host databases
cat result_MP | cut -f 2 > 1 | remove_duplicate.pl 1 | mv duplicate_remove MP
cat result_HP | cut -f 2 > 1 | remove_duplicate.pl 1 | mv duplicate_remove HP
cat result_MN | cut -f 2 > 1 | remove_duplicate.pl 1 | mv duplicate_remove MN
cat result_HN | cut -f 2 > 1 | remove_duplicate.pl 1 | mv duplicate_remove HN
mkdir combo | cd combo/
cp ../1_host/nucl_host/hybrid_HN/HN ./
cp ../1_host/prot_host/hybrid_HP/HP ./
cp ../2_myxo/nucl_myxo/hybrid_MN/MN ./
cp ../2_myxo/prot_myxo/hybrid_MP/MP ./
cat HN HP > H_all
cat MN MP > M_all
remove_duplicate.pl H_all | mv duplicate_remove H
remove_duplicate.pl M_all | mv duplicate_remove M
list_compare.pl M H
mkdir 1_only_to_host 2_only_to_myxo 3_both_match 4_neither_match 5_delete_host
cd 1_only_to_host | cp ../../3_combo/H_only ./Host_only_list
cd ../2_only_to_myxo/ | cp ../../3_combo/M_only ./Myxo_only_list
cd ../3_both_match/ | cp ../../3_combo/inter_of_M_and_H ./both_match_list
cd ../4_neither_match | cp ../../3_combo/union_of_M_and_H ./
ln -s ~/transcriptome/3_1_assembly/M.wulii/WL_Unigene.fasta ./
remove_contaminant_by_ID.pl WL_Unigene.fasta union_of_M_and_H
mv survive.fasta M.wulii_neither_match.fasta
extract_fasta_header.pl M.wulii_neither_match.fasta
mv header neither_match_list
cd 5_delete_host
remove_contaminant_by_ID.pl /home/gqx/transcriptome/assembly/WL_Unigene.fasta ../1_only_to_host/Host_only_list
mv survive.fasta WL_host_delete_Unigene.fasta
2.6 Collected proteins for constructing bacterial no-redundant database
2.7 Remove redundancy in bacterial database and blast the host-seq-removed transcriptomes
cd-hit -i bac_protein.fasta -o bac_protein_cdhit.fasta -c 0.95 -n 5 -T 8
makeblastdb -in bac_protein_cdhit.fasta -out BAC -dbtype prot -parse_seqids -hash_index
blastx -query WL_host_delete_Unigene.fasta -db BAC -out result_bac_e10 -evalue 1e-10 -outfmt 6 -num_threads 32
2.8 First round removement of bacterial and confirmation
cat result_bac_e10 | cut -f1 > 1
remove_duplicate.pl 1
mv duplicate_remove bacteria_contam_list
extract_seq_from_fasta.pl WL_host_delete_Unigene.fasta bacteria_contam_list
mv extracted.fasta bacteria_contam_first.fasta
blastx -query bacteria_contam_first.fasta -db nr -out bacteria_contam_first_nr_result -evalue 1e-5 -max_target_seqs 1 -num_threads 56 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle frames sscinames sskingdoms"
grep "Bacteria" bacteria_contam_first_nr_result | cut -f 1 > 2
remove_duplicate.pl 2 | rm 2
mv duplicate_remove true_bacteria_contam.list
remove_contaminant_by_ID.pl WL_host_delete_Unigene.fasta true_bacteria_contam.list
2.9 Second round removement of bacterial contamination
diamond makedb --in uniref90.fasta --db uniref90
diamond blastx -q WL_host_delete_Unigene.fasta --sensitive -k 20 -c 1 --threads 32 --db uniref90 --out diamond_result
perl -lne 'BEGIN{open UT, "<uniref90.taxlist" or die $!; while (
blastn -task megablast -query WL_host_delete_Unigene.fasta -db nt -culling_limit 5 -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle staxids sscinames sskingdoms' -num_threads 48 -evalue 1e-25 -out assembly_megablast_25.out
perl -lne 'BEGIN{open UT, "<acc2tax_nuclall.txt" or die $!; while (
cat ../diamond/assembly_diamond_10.out_taxid ../megablast/assembly_megablast_25.out_taxid > all
export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr all | sort -u -k1,1 --merge > bestHits
cat bestHits | rev | cut -f 1 | rev > 2
tax2name.py > result_1
get_species_name_from_ete3.pl
species_distribution.pl names
cat ./* > fish_protein.fasta
cd-hit -i fish_protein.fasta -o fish_protein_cdhit.fasta -c 0.95 -n 5 -T 8
makeblastdb -in fish_protein_cdhit.fasta -out FISH -dbtype prot -parse_seqids -hash_index
cat ./* > bac_protein.fasta
cd-hit -i bac_protein.fasta -o bac_protein_cdhit.fasta -c 0.95 -n 5 -T 8
makeblastdb -in bac_protein_cdhit.fasta -out BAC -dbtype prot -parse_seqids -hash_index
blastx -query WL_all_filter_Unigene.fasta -db BAC -out result_bac_1 -evalue 1 -outfmt 6 -num_threads 48
blastx -query WL_all_filter_Unigene.fasta -db FISH -out result_fish_1 -evalue 1 -outfmt 6 -num_threads 48
cat result_bac_1 | cut -f1 > 1
remove_duplicate.pl 1
mv duplicate_remove bacteria_contam_list
cat result_fish_1 | cut -f1 > 1
remove_duplicate.pl 1
mv duplicate_remove fish_contam_list
extract_seq_from_fasta.pl WL_all_filter_Unigene.fasta bacteria_contam_list
mv extracted.fasta bacteria_contam_first.fasta
extract_seq_from_fasta.pl WL_all_filter_Unigene.fasta fish_contam_list
mv extracted.fasta fish_contam_first.fasta
diamond makedb --in nr -d nr -p 24
diamond blastx -q fish_contam_first.fasta --sensitive -k 20 -c 1 --evalue 1e-5 --threads 48 --db nr.dmnd --out fish_diamond_5.out
diamond blastx -q bacteria_contam_first.fasta --sensitive -k 20 -c 1 --evalue 1e-5 --threads 48 --db nr.dmnd --out bacteria_diamond_5.out
export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr bacteria_diamond_5.out | sort -u -k1,1 --merge >> diamond_bestHits
export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr fish_diamond_5.out | sort -u -k1,1 --merge >> diamond_bestHits
give_tax_2_diamond_blastx.pl diamond_bestHits
grep "Bacteria" diamond_blastx_with_tax > list_2
grep " Teleostomi" diamond_blastx_with_tax > list_2
cat bacteria/list_2 fish/list_2 > bad_list
cat bad_list | cut -f1 > 1
remove_duplicate.pl 1
mv duplicate_remove true_bad_list
remove_contaminant_by_ID.pl WL_all_filter_Unigene.fasta true_bad_list
3.Remove contamination from genomes by TAGC methods
3.1 Install Blobtools, details see https://blobtools.readme.io/docs
3.2 Process the genomes and get the mapping results
filter_fasta_by_length.pl WL_genome.fasta 200 200000 WL_genome_200.fasta
bowtie2-build WL_genome_200.fasta index --threads 8
bowtie2 -p 24 -x index -1 M.wulii_1.fq -2 M.wulii_2.fq -k 1 --very-fast-local -S out.sam
samtools view -bS out.sam > out.bam
3.3 Blast against NCBI Nucleotide database using megablast and against UniRef90 using diamond BLASTX
blastn -task megablast -query WL_genome_200.fasta -db nt -culling_limit 5 -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle staxids sscinames sskingdoms' -num_threads 48 -evalue 1e-25 -out assembly_megablast_25.out
perl -lne 'BEGIN{open UT, "<acc2tax_nuclall.txt" or die $!; while (
awk -v OFS="\t" -F"\t" '{print $1,$17,$12}' assembly_megablast_25.out_taxid > mega.out
diamond makedb --in uniref90.aa -d uniref90
diamond blastx -q WL_genome_200.fasta --sensitive -k 20 -c 1 --evalue 1e-10 --threads 48 --db uniref90.dmnd --out assembly_diamond_10.out
perl -lne 'BEGIN{open UT, "<uniref90.taxlist" or die $!; while (
awk -v OFS="\t" -F"\t" '{print $1,$13,$12}' assembly_diamond_10.out_taxid > diamond.out
3.4 Process above results by the blobtools script to annotate each scaffold
cat mega.out diamond.out > blast.out
sort_blast_by_query_name.pl blast.out
mv sorted_output blast.out
python2.7 blobtools create -i WL_genome_200.fasta -b out.bam -t blast.out -o M.wulii_1_blob --names names.dmp --nodes nodes.dmp
python2.7 ../blobtools view -i M.wulii_1_blob.blobDB.json -o ./
3.5 Visualize the annotation results into Taxon-Annotated-Gc-Coverage plot (TAGC)
python2.7 blobtools blobplot -i M.wulii_1_blob.blobDB.json -o ./ --format pdf --colours colours.txt
3.6 Potential contamination is inspected manually and compared against NCBI Nucleotide database
format.sh > result
blob_result_seq_extract.pl M.wulii _1_blob.blobDB.table.txt
extract_seq_from_fasta.pl WL_genome_200.fasta seq_for_blast
mv extracted.fasta contam_candidate.fa
blastn -query contam_candidate.fa -db nt -evalue 1e-5 -max_target_seqs 20 -num_threads 24 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle frames sscinames sskingdoms" -out nt_result
export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr nt_result | sort -u -k1,1 --merge > bestHits
cat bestHits | cut -f2 > acc
perl -p -i -e 's/.(\d)//g' acc
acc2tax -i acc -o result -d accession2taxid
Teleostomi_Bacteria_extract.pl result
cat bestHits | cut -f1,2 > header
perl -p -i -e 's/.(\d)//g' header
get_contam_from_accesion.pl contam_accession header
another_blob_result_seq_extract.pl M.wulii_1_blob.blobDB.table.txt
cat seq_remove_by_bam0 final_contam_header > true_bad_list
remove_duplicate.pl true_bad_list
mv duplicate_remove true_bad_list
3.7 Remove contamination in genomes
remove_contaminant_by_ID.pl WL_genome_200.fasta true_bad_list
mv survive.fasta genome.fasta
4.Genome gene prediction
4.1 GeneMark-ET provides species HMM file to Augustus for training
tophat -o TophatOutput -p 4 --no-novel-juncs ~/transcriptome/myxobolus/carp_remov/index/genome ~/transcriptome/myxobolus/processing/combinedF.fastq ~/transcriptome/myxobolus/processing/combinedR.fastq
bet_to_gff.pl --bed junctions.bed -gff introns.gff --label tophat2 --seq genome.fasta
gmes_petap.pl --sequence genome.fasta --ET introns.gff --et_score 10 --cores 4 --min_contig 2000
/opt/biosoft/PASApipeline-2.0.2/misc_utilities/gtf_to_gff3_format.pl genemark.gtf genome.fasta > genemark.gff3
filterGenemark.pl genemark.gtf introns.gff
/opt/biosoft/PASApipeline-2.0.2/misc_utilities/gtf_to_gff3_format.pl genemark.f.good.gtf genome.fasta >genemark.f.good.gff3 mv genemark.f.good.gff3 best_candidates.gff3
/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/gff3_file_to_proteins.pl best_candidates.gff3 genome.fasta prot > best_candidates.fasta
remove_redundant_high_identity_genes.pl best_candidates.gff3 best_candidates.fasta 4 0.70 > best_candidates.lowIdentity.gff3 2> remove_redundant_high_identity_genes.log
4.2 Run Augustus gene prediction
gff2gbSmallDNA.pl best_candidates.lowIdentity.gff3 genome.fasta 800 genes.raw.gb
new_species.pl --species=for_bad_genes_removing
etraining --species=for_bad_genes_removing --stopCodonExcludedFromCDS=false genes.raw.gb 2> train.err
cat train.err | perl -pe 's/.in sequence (\S+): ./$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
randomSplit.pl genes.gb 100
new_species.pl --species=myxobolus_wulii
etraining --species=myxobolus_wulii genes.gb.train > train.out
augustus --species=myxobolus_wulii genes.gb.test | tee firsttest.out
optimize_augustus.pl --species=myxobolus_wulii --cpus=8 genes.gb.train
etraining --species=myxobolus_wulii genes.gb.train
augustus --species=myxobolus_wulii genes.gb.test | tee secondtest.out
mkdir repeatMasker
cd repeatMasker
fasta_no_blank.pl genome.fasta > genome2.fasta
rm genome.fasta
mv genome2.fasta genome.fasta
RepeatMasker -pa 24 -e ncbi -species cnidaria -gff -dir repeatMasker genome.fasta
cd ..
mkdir repeatModeler
cd repeatModeler
/opt/biosoft/RepeatModeler/BuildDatabase -name wulii -engine ncbi genome.fasta
/opt/biosoft/RepeatModeler/RepeatModeler -database wulii -pa 8
/opt/biosoft/RepeatMasker/RepeatMasker -pa 4 -e ncbi -lib RM_37443.MonJan231456212017/consensi.fa.classified -dir ./ -gff genome.fasta
cd ..
merge_repeatMasker_out.pl repeatMasker/genome.fasta.out repeatModeler/genome.fasta.out > genome.repeat.stats
maskedByGff.pl genome.repeat.gff3 genome.fasta hardmaskN > genome.hardmaskN.fasta
mv genome.hardmaskN.fasta genome_db.fa
bowtie2-build genome_db.fa genome_db --threads 8
tophat2 -N 3 --read-edit-dist 3 -p 32 -i 20 -I 4000 --min-segment-intron 20 --max-segment-intron 4000 --min-coverage-intron 20 --max-coverage-intron 4000 --coverage-search --microexon-search -o result genome_db M.wulii_1_fixed.fastq M.wulii_2_fixed.fastq
bam2hints --intronsonly --in=result/accepted_hits.bam --out=hints.gff
augustus --species=myxobolus_wulii_1 --extrinsicCfgFile=extrinsic.cfg --alternatives-from-evidence=true --allow_hinted_splicesites=atac --hintsfile=hints.gff --gff3=on genome.fasta > aug.gff3
perl -p -i -e 's/\ttranscript\t/\tmRNA\t/' aug.gff3
4.3 GeneMark-ET provides species HMM file to SNAP for training
maker2zff genemark.gff3
extract_header_for_snap.pl genome.dna
fastaindex genome.fasta genome.idx
fastafetch -f genome.fasta -i genome.idx -Fq <(sort -u header) > out
mv out genome.dna
fathom -categorize 1000 genome.ann genome.dna
fathom uni.ann uni.dna -export 1000 -plus
mkdir params
cd params/
forge ../export.ann ../export.dna
cd ..
hmm-assembler.pl species params/ > species.hmm
4.4 Run SNAP gene prediction
snap species.hmm genome.fasta -gff -quiet > snap.gff
snap2gff3.pl snap.gff > snap_ture.gff
4.5 Homology-based gene prediction by Genewise
rename_fasta_by_numeber.pl all_cdhit.fa
/opt/biosoft/homolog_genewise/homolog_genewise.pl rename_all.fasta genome.hardmaskN.fasta 8 0.1 1e-9
/opt/biosoft/homolog_genewise/genewise_filter.pl genewise.gff genome.hardmaskN.fasta 15 90 1 1e-6 0.30 4 > genewise.filter.gff 2> genewise.filter.stats
/opt/biosoft/homolog_genewise/genewise2EVM_input.pl genewise.filter.gff genome.hardmaskN.fasta filterMiddleStopCodon=yes > evm_protein_alignment.gff3 2> genewise_gene_models_completeness_check.txt
perl -p -i -e 's/^#.//; s/^\s$//' genewise.gff
make_evm_recognize_gff.pl genewise.gff
/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/misc/SNAP_to_GFF3.pl evm_wise.gff3 > final_evm_wise.gff3
4.6 Gene prediction by PASA
perl -e 'while (<>) { print "$1\n" if />(\S+)/ }' /home/train/00.incipient_data/data_for_gene_prediction_and_RNA-seq/Trinity.fasta > tdn.accs
seqclean Trinity.fasta -v /opt/biosoft/PASApipeline-2.0.2/seqclean/UniVec
cp /opt/biosoft/PASApipeline-2.0.2/pasa_conf/pasa.alignAssembly.Template.txt alignAssembly.config
DATE=date +%Y%m%e%k%M%S | perl -pe 's/\s+//'
echo "perl -p -i -e 's/MYSQLDB=.*/MYSQLDB=pasa_$DATE/' alignAssembly.config" | sh
/opt/biosoft/PASApipeline-2.0.2/scripts/create_mysql_cdnaassembly_db.dbi -r -c alignAssembly.config -S /opt/biosoft/PASApipeline-2.0.2/schema/cdna_alignment_mysqlschema
/opt/biosoft/PASApipeline-2.0.2/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -R -g genome.fasta -t Trinity.fasta.clean -T -u Trinity.fasta --ALIGNERS gmap,blat --CPU 8 --stringent_alignment_overlap 30.0 --TDN tdn.accs --MAX_INTRON_LENGTH 20000 --TRANSDECODER &> pasa.log
4.7 Combine above gff3 file by EVM
/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl aug.gff3 > evm_augustus.gff3
perl -p -i -e 's/^#.//; s/^\s$//' evm_augustus.gff3
gff3_gene_prediction_file_validator.pl evm_augustus.gff3
ln -s ../genemark-et/genemark.gff3 evm_genemark-et.gff3
gff3_gene_prediction_file_validator.pl evm_genemark-et.gff3
/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/misc/SNAP_output_to_gff3.pl snap.zff genome.fasta > snap.gff3
gff3_gene_prediction_file_validator.pl snap.gff3
perl -p -i -e 's/^(\S+)\t(.)\t/$1\tSNAP\t/g' snap.gff3
cp ../new_pasa/pasa*.pasa_assemblies.gff3 ./transcript_alignments.gff3
perl -p -i -e 's/\t\S+/\tpasa_transcript_alignments/' transcript_alignments.gff3
gff3_gene_prediction_file_validator.pl transcript_alignments.gff3
ln -s ../genewise/wise/evm_protein_alignment.gff3 ./protein_alignments.gff3
gff3_gene_prediction_file_validator.pl protein_alignments.gff3
cat evm_augustus.gff3 evm_genemark-et.gff3 snap.gff3 | perl -pe 's/^#.//; s/^\s$//' > gene_predictions.gff3
echo -e "ABINITIO_PREDICTION\tAugustus\t6
ABINITIO_PREDICTION\tSNAP\t2
ABINITIO_PREDICTION\tGeneMark.hmm\t1
PROTEIN\tGeneWise\t5
TRANSCRIPT\tpasa_transcript_alignments\t10" > weights.txt
partition_EVM_inputs.pl --genome genome.fasta --gene_predictions gene_predictions.gff3 --protein_alignments protein_alignments.gff3 --transcript_alignments transcript_alignments.gff3 --repeats genome.repeat.gff3 --segmentSize 500000 --overlapSize 10000 --partition_listing partitions_list.out
write_EVM_commands.pl --genome genome.fasta --gene_predictions gene_predictions.gff3 --protein_alignments protein_alignments.gff3 --transcript_alignments transcript_alignments.gff3 --repeats genome.repeat.gff3 --weights pwd
/weights.txt --partitions partitions_list.out --output_file_name evm.out > commands.list
ParaFly -c commands.list -CPU 4
recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out
convert_EVM_outputs_to_GFF3.pl --partitions partitions_list.out --output_file_name evm.out --genome genome.fasta
find . -regex ".*evm.out.gff3" -exec cat {} \; > EVM.all.gff3
/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/gff3_file_to_proteins.pl EVM.all.gff3 ../evm/genome.fasta prot > genome_protein.fasta
5.Transcriptome gene prediction
5.1 De novo prediction by TransDecoder
TransDecoder.LongOrfs -t WL_second_all_filter_Unigene.fasta -m 20
TransDecoder.Predict -t WL_second_all_filter_Unigene.fasta --cpu 12
5.2 De novo prediction by GeneMarkS-T
/opt/biosoft/GeneMarkS-T/gmst.pl --output M.wulii_gmst --fnn --faa -clean 1 WL_second_all_filter_Unigene.fasta
5.3 Homolog-based prediction by Hercules (https://github.com/qingxiangguo/hercules-v.1.0)
blastx -query WL_second_all_filter_Unigene.fasta -db nr -max_target_seqs 20 -out nr_result -evalue 1e-5 -num_threads 48 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle frames sscinames sskingdoms"
blastx -query WL_second_all_filter_Unigene.fasta -db kog -max_target_seqs 20 -out kog_result -evalue 1e-5 -num_threads 48 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle frames sscinames sskingdoms"
blastx -query WL_second_all_filter_Unigene.fasta -db swiss -max_target_seqs 20 -out swiss_result -evalue 1e-5 -num_threads 48 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle frames sscinames sskingdoms"
blastx -query WL_second_all_filter_Unigene.fasta -db eggnog -max_target_seqs 20 -out eggnog_result -evalue 1e-5 -num_threads 48 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle frames sscinames sskingdoms"
for next in $(cut -f1 nr_result | sort -u); do grep -w -m 20 "$next" nr_result; done > tmp
sort_blast_by_query_name.pl tmp
mv sorted_output nr | rm tmp
for next in $(cut -f1 eggnog_result | sort -u); do grep -w -m 20 "$next" eggnog_result; done > tmp
sort_blast_by_query_name.pl tmp
mv sorted_output eggnog | rm tmp
for next in $(cut -f1 swiss_result | sort -u); do grep -w -m 20 "$next" swiss_result; done > tmp
sort_blast_by_query_name.pl tmp
mv sorted_output swiss | rm tmp
for next in $(cut -f1 kog_result | sort -u); do grep -w -m 20 "$next" kog_result; done > tmp
sort_blast_by_query_name.pl tmp
mv sorted_output kog | rm tmp
hercules nr swiss egg kog
anvi-gen-contigs-database -f contigs.fa -o contigs.db --external-gene-calls gene_call
anvi-get-aa-sequences-for-gene-calls -c contigs.db -o homolog.fa
5.4 Six-frame translation
cp ../trans_homolog/gene_call ./
grep ">" ../trans_denovo/M.wulii_gmst.faa > gmst
grep ">" ../trans_denovo/WL_second_all_filter_Unigene.fasta.transdecoder.pep > transdecoder
cat gene_call | cut -f2 > gene_call_header
perl -p -i -e "s/\s.*//" transdecoder
perl -p -i -e "s/>Gene.\d+:://" transdecoder
perl -p -i -e "s/:.*//" transdecoder
perl -p -i -e "s/\s.*//" gmst
perl -p -i -e "s/>//" gmst
cat gene_call_header gmst transdecoder > annotation_list
remove_duplicate.pl annotation_list
rm annotation_list
mv duplicate_remove annotation_list
remove_contaminant_by_ID.pl WL_second_all_filter_Unigene.fasta annotation_list
mv survive.fasta trans_left.fasta
transeq -sequence trans_left.fasta -outseq out -frame 6
flat_the_fasta_seq.pl out
rm out
get_seq_between_asterisk.pl flated 30
mv between_asterisk trans_left.pep
6.1 Collect all predicted proteins
cat genome_protein.fasta WL_second_all_filter_Unigene.fasta.transdecoder.pep M.wulii_gmst.fasta homolog.fa trans_left.pep > all.fasta
6.2 Remove redundancy and filter by length
cd-hit -i all.fasta -o all_cdhit.fasta -c 1 -T 4 -M 0
filter_fasta_by_length.pl all_cdhit.fasta 30 1000000 filtered.fasta
6.3 Give name tag to CCPRD
replace_fasta_header_by_number.pl filtered.fasta WL_MC
mv ordered.fasta CCPRD
7.1 Transcriptome six-frame translation
transeq -sequence WL_second_all_filter_Unigene.fasta -outseq out -frame 6 -clean
remove_no_end_asterisks.pl out
flat_the_fasta_seq.pl out
rm out
get_seq_between_asterisk.pl flated 30
cd-hit -i between_asterisk -o out -c 1 -M 160000 -T 8
replace_fasta_header_by_number.pl out WL_T6
mv ordered.fasta trans_6_frame
7.2 Genome and transcriptome six-frame translation
cat genome.fasta WL_second_all_filter_Unigene.fasta > all.fasta
transeq -sequence all.fasta -outseq out -frame 6 -clean
remove_no_end_asterisks.pl out
flat_the_fasta_seq.pl out
rm out
get_seq_between_asterisk.pl flated 30
cd-hit -i between_asterisk -o out -c 1 -M 160000 -T 8
replace_fasta_header_by_number.pl out WL_A6
7.3 CCPRD + contaminants
7.4 CCPRD + sequences removed in decontamination process
1.Download and process the transcriptomes and genomes
1.1 Download the source data
mkdir 1_source_data && cd 1_source_data
wget https://sra-download.ncbi.nlm.nih.gov/traces/wgs03/wgs_aux/GF/JR/GFJR01/GFJR01.1.fsa_nt.gz
gunzip GFJR01.1.fsa_nt.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
gunzip GCF_000146045.2_R64_genomic.fna.gz
1.2 Remove redundancy in transcriptomes with CD-HIT
cd ..
mkdir 2_cluster _mRNA && cd 2_cluster _mRNA
ln -s ../1_source_data/GFJR01_mRNA.fasta ./
cd-hit-est -i GFJR01_mRNA.fasta -o yeast.fasta_cdhit -c 0.95 -n 10 -d 0 -M 16000 -T 8
1.3 Further cluster the transcriptomes with TGICL
tgicl -l 40 -c 10 -v 25 -O '-repeat_stringency 0.95 -minmatch 35 -minscore 35' -F yeast.fasta_cdhit
1.4 Process the cd-hit results to get unigenes
cat asm_1/contigs asm_2/contigs asm_3/contigs asm_4/contigs asm_5/contigs asm_6/contigs asm_7/contigs asm_8/contigs asm_9/contigs asm_10/contigs > all.contigs
fast_extract_seq_from_fasta.pl yeast.fasta_cdhit yeast.fasta_cdhit.singletons > extracted.fasta
Unigene_generator.pl -s extracted.fasta -c all.contigs -t YT
2.Remove potential bacterial contamination in transcriptome data
2.1 Collected proteins for constructing bacterial no-redundant database
2.2 Remove redundancy in bacterial database and blast the transcriptomes
cd ..
mkdir 3_decontam_mRNA
cd 3_decontam_mRNA/
cd-hit -i bac_protein.fasta -o bac_protein_cdhit.fasta -c 0.95 -n 5 -T 8
makeblastdb -in bac_protein_cdhit.fasta -out BAC -dbtype prot -parse_seqids -hash_index
blastx -query ../../2_cluster_mRNA/YT_Unigene.fasta -db BAC -out result_bac_e10 -evalue 1e-10 -outfmt 6 -num_threads 32
2.3 First round removement of bacterial and confirmation
cat result_bac_e10 | cut -f1 > 1
remove_duplicate.pl 1
mv duplicate_remove bacteria_contam_list
extract_seq_from_fasta.pl ../../2_cluster_mRNA/YT_Unigene.fasta bacteria_contam_list
mv extracted.fasta bacteria_contam_first.fasta
blastx -query bacteria_contam_first.fasta -db nr -out bacteria_contam_first_nr_result -evalue 1e-5 -max_target_seqs 1 -num_threads 56 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle frames sscinames sskingdoms"
grep "Bacteria" bacteria_contam_first_nr_result | cut -f 1 > 2
remove_duplicate.pl 2 | rm 2
mv duplicate_remove true_bacteria_contam.list
remove_contaminant_by_ID.pl WL_host_delete_Unigene.fasta true_bacteria_contam.list
3.Genome gene prediction
mkdir 4_MS_database
cd 4_MS_database
mkdir all_6_frame CCPRD CCPRD_contam trans_6_frame
cd 4_MS_database/CCPRD
mkdir genome_denovo genome_homolog genome_pasa trans_denovo trans_homolog trans_left_six_frame
3.1 GeneMark-ES prediction
cd 4_MS_database/CCPRD/ genome_denovo/GeneMark-ES
ln -s ~/examples/1_source_data/GCF_000146045.2_R64_genomic.fna ./genome.fasta
gmes_petap.pl –sequence genome.fasta –ES –fungus –cores 4
gtf_to_gff3_format.pl genemark.gtf genome.fasta > genemark.gff3
3.2 Augustus gene prediction
cd 4_MS_database/CCPRD/ genome_denovo/AUGUSTUS
ln -s ~/examples/1_source_data/GCF_000146045.2_R64_genomic.fna ./genome.fasta
augustus --species=saccharomyces_cerevisiae_S288C --gff3=on genome.fasta > aug.gff3
3.3 SNAP gene prediction
cd 4_MS_database/CCPRD/ genome_denovo/SNAP
ln -s ../GeneMark-ES/genome.fasta ./genome.fasta
ln -s ../GeneMark-ES/genemark.gff3 ./
maker2zff genemark.gff3
extract_header_for_snap.pl genome.dna
fastaindex genome.fasta genome.idx
fastafetch -f genome.fasta -i genome.idx -Fq <(sort -u header) > out
mv out genome.dna
fathom -categorize 1000 genome.ann genome.dna
fathom uni.ann uni.dna -export 1000 -plus
mkdir params
cd params/
forge ../export.ann ../export.dna
cd ..
hmm-assembler.pl species params/ > species.hmm
snap species.hmm genome.fasta -gff -quiet > snap.gff
snap2gff3.pl snap.gff > snap_true.gff
3.4 Homology-based gene prediction by Genewise
/opt/biosoft/homolog_genewise/homolog_genewise.pl rename_all.fasta genome.hardmaskN.fasta 8 0.1 1e-9
/opt/biosoft/homolog_genewise/genewise_filter.pl genewise.gff genome.hardmaskN.fasta 15 90 1 1e-6 0.30 4 > genewise.filter.gff 2> genewise.filter.stats
/opt/biosoft/homolog_genewise/genewise2EVM_input.pl genewise.filter.gff genome.hardmaskN.fasta filterMiddleStopCodon=yes > evm_protein_alignment.gff3 2> genewise_gene_models_completeness_check.txt
perl -p -i -e 's/^#.//; s/^\s$//' genewise.gff
make_evm_recognize_gff.pl genewise.gff
/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/misc/SNAP_to_GFF3.pl evm_wise.gff3 > final_evm_wise.gff3
3.5 Gene prediction by PASA
cd 4_MS_database/CCPRD/genome_pasa
perl -e 'while (<>) { print "$1\n" if />(\S+)/ }' YT_Unigene.fasta > tdn.accs
seqclean YT_Unigene.fasta -v /opt/biosoft/PASApipeline-2.0.2/seqclean/UniVec
cp /opt/biosoft/PASApipeline-2.0.2/pasa_conf/pasa.alignAssembly.Template.txt alignAssembly.config
DATE=date +%Y%m%e%k%M%S | perl -pe 's/\s+//'
echo "perl -p -i -e 's/MYSQLDB=.*/MYSQLDB=pasa_$DATE/' alignAssembly.config" | sh
/opt/biosoft/PASApipeline-2.0.2/scripts/create_mysql_cdnaassembly_db.dbi -r -c alignAssembly.config -S /opt/biosoft/PASApipeline-2.0.2/schema/cdna_alignment_mysqlschema
/opt/biosoft/PASApipeline-2.0.2/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -R -g ../genome_denovo/GeneMark-ES/genome.fasta -t YT_Unigene.fasta.clean -T -u ../../../2_cluster_mRNA/YT_Unigene.fasta --ALIGNERS gmap,blat --CPU 8 --stringent_alignment_overlap 30.0 --TDN tdn.accs --MAX_INTRON_LENGTH 20000 --TRANSDECODER &> pasa.log
3.6 Combine above gff3 file by EVM
/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl aug.gff3 > evm_augustus.gff3
perl -p -i -e 's/^#.//; s/^\s$//' evm_augustus.gff3
gff3_gene_prediction_file_validator.pl evm_augustus.gff3
ln -s 4_MS_database/CCPRD/genome_denovo/GeneMark-ES/genemark.gff3 evm_genemark-es.gff3
gff3_gene_prediction_file_validator.pl evm_genemark-es.gff3
/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/misc/SNAP_output_to_gff3.pl snap.zff genome.fasta > snap.gff3
gff3_gene_prediction_file_validator.pl snap.gff3
perl -p -i -e 's/^(\S+)\t(.)\t/$1\tSNAP\t/g' snap.gff3
cp ../new_pasa/pasa*.pasa_assemblies.gff3 ./transcript_alignments.gff3
perl -p -i -e 's/\t\S+/\tpasa_transcript_alignments/' transcript_alignments.gff3
gff3_gene_prediction_file_validator.pl transcript_alignments.gff3
ln -s ../genewise/wise/evm_protein_alignment.gff3 ./protein_alignments.gff3
gff3_gene_prediction_file_validator.pl protein_alignments.gff3
cat evm_augustus.gff3 evm_genemark-es.gff3 snap.gff3 | perl -pe 's/^#.//; s/^\s$//' > gene_predictions.gff3
echo -e "ABINITIO_PREDICTION\tAugustus\t6
ABINITIO_PREDICTION\tSNAP\t2
ABINITIO_PREDICTION\tGeneMark.hmm\t1
PROTEIN\tGeneWise\t5
TRANSCRIPT\tpasa_transcript_alignments\t10" > weights.txt
partition_EVM_inputs.pl --genome genome.fasta --gene_predictions gene_predictions.gff3 --protein_alignments protein_alignments.gff3 --transcript_alignments transcript_alignments.gff3 --repeats genome.repeat.gff3 --segmentSize 500000 --overlapSize 10000 --partition_listing partitions_list.out
write_EVM_commands.pl --genome genome.fasta --gene_predictions gene_predictions.gff3 --protein_alignments protein_alignments.gff3 --transcript_alignments transcript_alignments.gff3 --repeats genome.repeat.gff3 --weights pwd
/weights.txt --partitions partitions_list.out --output_file_name evm.out > commands.list
ParaFly -c commands.list -CPU 4
recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out
convert_EVM_outputs_to_GFF3.pl --partitions partitions_list.out --output_file_name evm.out --genome genome.fasta
find . -regex ".*evm.out.gff3" -exec cat {} \; > EVM.all.gff3
/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/gff3_file_to_proteins.pl EVM.all.gff3 ../evm/genome.fasta prot > genome_protein.fasta
4.Transcriptome gene prediction
4.1 De novo prediction by TransDecoder
cd 4_MS_database/CCPRD/trans_denovo
TransDecoder.LongOrfs -t ../../../2_cluster_mRNA/YT_Unigene.fasta -m 20
TransDecoder.Predict -t ../../../2_cluster_mRNA/YT_Unigene.fasta --cpu 12
4.2 De novo prediction by GeneMarkS-T
gmst.pl --output YT_gmst --fnn --faa -clean 1 ../../../2_cluster_mRNA/YT_Unigene.fasta
4.3 Homolog-based prediction by Hercules (https://github.com/qingxiangguo/hercules-v.1.0)
cd 4_MS_database/CCPRD/trans_homolog
blastx -query ../../../2_cluster_mRNA/YT_Unigene.fasta -db KOG -max_target_seqs 20 -out kog_result -evalue 1e-5 -num_threads 48 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle frames sscinames sskingdoms" blastx -query ../../../2_cluster_mRNA/YT_Unigene.fasta -db swiss -max_target_seqs 20 -out swiss_result -evalue 1e-5 -num_threads 48 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle frames sscinames sskingdoms"
for next in $(cut -f1 swiss_result | sort -u); do grep -w -m 20 "$next" swiss_result; done > tmp
sort_blast_by_query_name.pl tmp
mv sorted_output swiss | rm tmp
for next in $(cut -f1 kog_result | sort -u); do grep -w -m 20 "$next" kog_result; done > tmp
sort_blast_by_query_name.pl tmp
mv sorted_output kog | rm tmp
hercules nr swiss egg kog
anvi-gen-contigs-database -f ../../../2_cluster_mRNA/YT_Unigene.fasta -o contigs.db --external-gene-calls gene_call
anvi-get-aa-sequences-for-gene-calls -c contigs.db -o homolog.fa
4.4 Six-frame translation
cd 4_MS_database/CCPRD/trans_left_six_frame
cp ../trans_homolog/gene_call ./
grep ">" ../trans_denovo/YT_gmst.faa > gmst
grep ">" ../trans_denovo/YT_Unigene.fasta.transdecoder.pep > transdecoder
cat gene_call | cut -f2 > gene_call_header
perl -p -i -e "s/\s.*//" transdecoder
perl -p -i -e "s/>Gene.\d+:://" transdecoder
perl -p -i -e "s/:.*//" transdecoder
perl -p -i -e "s/\s.*//" gmst
perl -p -i -e "s/>//" gmst
cat gene_call_header gmst transdecoder > annotation_list
remove_duplicate.pl annotation_list
rm annotation_list
mv duplicate_remove annotation_list
remove_contaminant_by_ID.pl ../../../2_cluster_mRNA/YT_Unigene.fasta annotation_list
mv survive.fasta trans_left.fasta
transeq -sequence trans_left.fasta -outseq out -frame 6
flat_the_fasta_seq.pl out
rm out
get_seq_between_asterisk.pl flated 30
mv between_asterisk trans_left.pep
5.Combine all the proteins predicted from genomes and transcriptomes
5.1 Collect all predicted proteins
cd 4_MS_database/CCPRD/final
cat genome_protein.fasta YT_Unigene.fasta.transdecoder.pep YT_gmst.fasta homolog.fa trans_left.pep > all.fasta
5.2 Remove redundancy and filter by length
cd-hit -i all.fasta -o all_cdhit.fasta -c 1 -T 4 -M 0
filter_fasta_by_length.pl all_cdhit.fasta 30 1000000 filtered.fasta
5.3 Give name tag to CCPRD
replace_fasta_header_by_number.pl filtered.fasta YT_CC
mv ordered.fasta CCPRD
6.Create alternative databases for comparison
6.1 Transcriptome six-frame translation
cd 4_MS_database/ trans_6_frame
transeq -sequence ../../2_cluster_mRNA/YT_Unigene.fasta -outseq out -frame 6
remove_no_end_asterisks.pl out
flat_the_fasta_seq.pl out
rm out
get_seq_between_asterisk.pl flated 30
cd-hit -i between_asterisk -o out -c 1 -M 160000 -T 8
replace_fasta_header_by_number.pl out YT_T6
mv ordered.fa trans_6_frame
6.2 Genome and transcriptome six-frame translation
cd 4_MS_database/ all_6_frame
cat ~/examples/1_source_data/GCF_000146045.2_R64_genomic.fna ../../2_cluster_mRNA/YT_Unigene.fasta > all.fasta
transeq -sequence all.fasta -outseq out -frame 6
flat_the_fasta_seq.pl out
rm out
get_seq_between_asterisk.pl flated 30
cd-hit -i between_asterisk -o out -c 1 -M 160000 -T 8
replace_fasta_header_by_number.pl out YT_A6
mv ordered.fa all_6_frame
6.3 CCPRD + contaminants
cat CCPRD BAC.fa > CCPRD_contam
Since CCPRD is a pipeline that depends several Bioinformatics tools, publication of results obtained by CCPRD requires that not only CCPRD is cited, but also the tools that are used by CCPRD: Please cite:
All source code, i.e. scripts/.pl, scripts/.sh or scripts/.py are under the MIT license.