soedinglab / metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
GNU General Public License v3.0
174 stars 23 forks source link

Split big CDS to subCDS using gff #52

Open ucabuk opened 1 year ago

ucabuk commented 1 year ago

Hello Eli,

I want to split one big predicted protein to exons according to their gff file. I have three output .fas .codon.fas .headersMap.tsv and .gff produced by Metaeuk.

In gtf file, CDS coordination is based on assembled contig. So I could not find the information of coordination where exon stop in protein (.fas) output. Basically, what I want to do is that,

This protein contains more than one exon. I want to

UniRef50_A0A699GG08|k127_10391|-|2222|0|11|36114|60000|60000[60000]:59629[59629]:372.... MTNSTHFGYQTVAEEEKVHKVAEVFHSVAAKYDVMNDVMSAGLHRLWKTFTIAQAGIRPGFKVLDIAGGTGDLAKAFAKKAGPTGEVWLTDINESMLRVGRDRLLNNG......

to

>UniRef50_A0A699GG08|k127_10391_CDS0 MTNSTHFGYQTVAEEEKVHKV >UniRef50_A0A699GG08|k127_10391_CDS1 AEEEKVHKVAEVFHSVAAKYDVM >UniRef50_A0A699GG08|k127_10391_CDS2 YDVMNDVMSAGLHRLWKTFTIA >UniRef50_A0A699GG08|k127_10391_CDS3 DLAKAFAKKAGPTGEVWLTDINESMLRVGRDRLLNNG ....

I could not find this information in Metaeuk gff file, This is based on contigs, so I am able to separate it in .codon.fas file using these information, not in output .fas

> k127_10391 MetaEuk CDS 59630 60001 186 - . ID=UniRef50_A0A699GG08;TCS_ID=UniRef50_A0A699GG08|k127_10391|-|36114_CDS_0;Parent=UniRef50_A0A699GG08|k127_10391|-|36114_exon_0 >k127_10391 MetaEuk CDS 58374 59321 203 - . ID=UniRef50_A0A699GG08;TCS_ID=UniRef50_A0A699GG08|k127_10391|-|36114_CDS_1;Parent=UniRef50_A0A699GG08|k127_10391|-|36114_exon_1 > k127_10391 MetaEuk CDS 56729 57589 462 - . ID=UniRef50_A0A699GG08;TCS_ID=UniRef50_A0A699GG08|k127_10391|-|36114_CDS_2;Parent=UniRef50_A0A699GG08|k127_10391|-|36114_exon_2 k127_10391 MetaEuk CDS 50451 50633 126 - . ID=UniRef50_A0A699GG08;TCS_ID=UniRef50_A0A699GG08|k127_10391|-|36114_CDS_3;Parent=UniRef50_A0A699GG08|k127_10391|-|36114_exon_3

Does Metaeuk provide any coordination information regarding splitting of exons in big coding sequence?

Thank you !

elileka commented 1 year ago

Hi,

I am not 100% sure I have understood your need so please correct me if I am wrong. It seems like you wish to split each single fasta record to multiple records, one for each exon. If so, then indeed, MetaEuk does not provide this kind of output but it should be possible to write a script that creates this fasta from the original fasta file*. Each exon is described in the fasta header, separated with pipes from the other exons. The numbers given for each exon are the original coordinates on the contig (please note the possible short overlap between exons. There is one between the first and second in your example). Also note that unlike the report in the MetaEuk header, the GFF coordinates start with index 1, as standard for that format. https://github.com/soedinglab/metaeuk#the-metaeuk-header

*I could assist with this, if needed