lh3 / miniprot

Align proteins to genomes with splicing and frameshift
https://lh3.github.io/miniprot/
MIT License
310 stars 16 forks source link

Using miniprot projections as evidence class in evidencemodeler #17

Closed yuzhenpeng closed 1 year ago

yuzhenpeng commented 1 year ago

Hello, @lh3

Thank you for developing and maintaining miniprot.

I created miniprot projections for one of my newly sequenced assemblies, which I wish to use in EVM for genome annotation. So I was wondering if you would be willing to share a script to convert miniprot output into a GFF3 file that EVM accepts? I thought I might ask if you already have a procedure developed for it.

Thanks in advance, Zhenpeng

lh3 commented 1 year ago

Miniprot outputs GFF3. What does EVM require?

CongLiu37 commented 1 year ago

Hello,

I have the same question. EvidenceModeler wants protein alignments in gff3. Example file can be found here https://evidencemodeler.github.io/

It would be great if miniprot can output both protein aligmnets and full-length gene structures with correct splice sites (gene, mRNA, exon, CDS, three_prime_cis_splice_site, five_prime_cis_splice_site) in two separate gff3 files. GenomeThreader can generate such files but need to run twice with different settings (-intermedia/-skipalignments) and is slow. Protein alignments can be integrated with other evidence by EvidenceModeler and gene structures can be used as training set for predictors like AUGUSTUS.

Many thanks!

Sincerely,

Cong

lh3 commented 1 year ago

The evm example doesn't have the three_prime_cis_splice_site or the five_prime_cis_splice_site features. The only difference from the miniprot gff seems the gene and exon features. Could you write a script to add these two features? If you can confirm evm requires those two features, I can add them.

CongLiu37 commented 1 year ago

Hello,

The protein alignment gff only has "match" feature and evm requires ID and Target in 9th field. It looks like this:

Contig1 nap-nr_minus_rice.fasta nucleotide_to_protein_match     8392    8470    50.00   -       .       ID=match.nap.nr_minus_rice.fasta.37;Target=RF|YP_440341.1|83716234|NC_007650 196 222
Contig1 nap-nr_minus_rice.fasta nucleotide_to_protein_match     7650    7786    26.09   -       .       ID=match.nap.nr_minus_rice.fasta.37;Target=RF|YP_440341.1|83716234|NC_007650 222 268

Contig1 nap-nr_minus_rice.fasta nucleotide_to_protein_match     8386    8509    26.83   -       .       ID=match.nap.nr_minus_rice.fasta.38;Target=RF|YP_099363.1|53713371|NC_006347 1 42
Contig1 nap-nr_minus_rice.fasta nucleotide_to_protein_match     7635    7786    24.00   -       .       ID=match.nap.nr_minus_rice.fasta.38;Target=RF|YP_099363.1|53713371|NC_006347 42 92

It is the second evm example for alignments instead of the first one for gene prediction gff with gene/exon features.

So it would be great if miniprot can output two gff files: one contains alignments and looks like what I pasted above, and another one contains gene structures with gene, exon, mRNA, exon, CDS features.

Many thanks!

Sincerely,

Cong

lh3 commented 1 year ago

You can write a script to generate the second gff. This is highly evm specific and seems redundant. I don't add features for just one tool.