Closed hputnam closed 4 years ago
need to write a script to grab the first instance of a gene name and its sequence information. would like this only for the first transcript per gene. Structural_annotation_abintio.gff
looking for 2 output files, one with CDS sequence and one with predicted protein sequence. I have attached a screenshot of the file below.
Example output desired for CDS fasta file
g1 CDS sequence g2 CDS sequence
Example output desired for Predicted protein fasta file
g1 protein sequence g2 protein sequence
@sr320 This is the code I suggest for extracting the CDS and protein seqs from Augustus
https://hputnam.github.io/Putnam_Lab_Notebook/Fasta_from_augustus/
perl getAnnoFasta.pl Structural_annotation_abintio.gff
Structural_annotation_abintio.aa
Structural_annotation_abintio.codingseq
awk '/^>/ {P=index($0,".t1")==0} {if(!P) print} ' Structural_annotation_abintio.aa > Pact_T1_Structural_annotation_abintio.aa
sed 's/\..*$//' Pact_T1_Structural_annotation_abintio.aa > Pact_protein.fa
awk '/^>/ {P=index($0,".t1")==0} {if(!P) print} ' Structural_annotation_abintio.codingseq > Pact_T1_Structural_annotation_abintio.codingseq
sed '/\..*\./s/^[^.]*\./>/' Pact_T1_Structural_annotation_abintio.codingseq > Pact_CDS.fas
sed 's/\..*$//' Pact_CDS.fas > Pact_CDS.fa
My PD Tejashree wrote a script that is single step. https://github.com/tejashree1modak/AUGUSTUS-helpers/blob/master/get-fasta.sh
Extract the coding sequence and protein sequences into 2 separate files to parallel the Mcap approach and generate predicted protein sequence file for annotation