hputnam / Meth_Compare

5 stars 2 forks source link

Pull gene sequence and protein sequence from Pact Structural_annotation_abintio.gff #71

Closed hputnam closed 4 years ago

hputnam commented 4 years ago

Extract the coding sequence and protein sequences into 2 separate files to parallel the Mcap approach and generate predicted protein sequence file for annotation

hputnam commented 4 years ago

gff_screenshot

hputnam commented 4 years ago

need to write a script to grab the first instance of a gene name and its sequence information. would like this only for the first transcript per gene. Structural_annotation_abintio.gff

looking for 2 output files, one with CDS sequence and one with predicted protein sequence. I have attached a screenshot of the file below.

Example output desired for CDS fasta file

g1 CDS sequence g2 CDS sequence

Example output desired for Predicted protein fasta file

g1 protein sequence g2 protein sequence

shellywanamaker commented 4 years ago

https://gannet.fish.washington.edu/seashell/bu-mox/data/froger/Pact_Genome/Data_to_downoload.rar

hputnam commented 4 years ago

https://github.com/nextgenusfs/augustus/blob/master/scripts/getAnnoFasta.pl

hputnam commented 4 years ago

@sr320 This is the code I suggest for extracting the CDS and protein seqs from Augustus

https://hputnam.github.io/Putnam_Lab_Notebook/Fasta_from_augustus/

hputnam commented 4 years ago

1. Use this perl script to extract fasta seq file for AUGUSTUS predicted genes and proteins

Perl Script

perl getAnnoFasta.pl Structural_annotation_abintio.gff

this results in 2 files:

Structural_annotation_abintio.aa Structural_annotation_abintio.codingseq

2. Files then need to be parsed for the first transcript only (e.g., g1.t1 ... gn.t1)

awk '/^>/ {P=index($0,".t1")==0} {if(!P) print} ' Structural_annotation_abintio.aa > Pact_T1_Structural_annotation_abintio.aa

sed 's/\..*$//' Pact_T1_Structural_annotation_abintio.aa > Pact_protein.fa

awk '/^>/ {P=index($0,".t1")==0} {if(!P) print} ' Structural_annotation_abintio.codingseq > Pact_T1_Structural_annotation_abintio.codingseq

sed '/\..*\./s/^[^.]*\./>/' Pact_T1_Structural_annotation_abintio.codingseq > Pact_CDS.fas

sed 's/\..*$//' Pact_CDS.fas > Pact_CDS.fa
hputnam commented 4 years ago

My PD Tejashree wrote a script that is single step. https://github.com/tejashree1modak/AUGUSTUS-helpers/blob/master/get-fasta.sh