Closed bheimbu closed 5 months ago
Hmmm I'll take a look at this today. IIRC the design was to mimic the output of Prodigal so I could merge GFFs together but I'll need to double check. I'm looking at the source code and I ended up commenting out the mRNA output line but they are parsed.
You'll have uncomment line 311 here:
https://github.com/jolespin/veba/blob/main/bin/scripts/compile_metaeuk_identifiers.py
I vaguely remember having an issue when running something downstream when including mRNA but I'll run some tests. If there's no issue then I'll restore this functionality. Apologies for the inconvenience.
Hi @jolespin,
it's not about uncommenting line 311, you'll have to include a line here actually …
# mRNA
mrna_description = "target_id={};tcs_id={};contig_id={};gene_id={};ID={};Parent={};".format(
id_target,
id_tcs,
id_contig,
id_gene,
id_gene,
id_gene,
)
mrna_fields = [id_contig, "MetaEuk", "mRNA", start_gene, end_gene, bitscore, strand, ".", mrna_description]
… like this …
# mRNA
mrna_description = "target_id={};tcs_id={};contig_id={};gene_id={};ID={};Parent={};".format(
id_target,
id_tcs,
id_contig,
id_gene,
id_gene,
id_gene,
)
mrna_fields = [id_contig, "MetaEuk", "mRNA", start_gene, end_gene, bitscore, strand, ".", mrna_description]
# gff_output.append("\t".join(map(str, mrna_fields)))
gff_output.append(mrna_fields)
So at least the output looks right …
scaffold10000_size3469 MetaEuk gene 518 2727 1039.0 + . target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);
scaffold10000_size3469 MetaEuk mRNA 518 2727 1039.0 + . target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);
scaffold10000_size3469 MetaEuk CDS 518 2727 1039.0 + . target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);
scaffold10000_size3469 MetaEuk exon 518 712 1039.0 + . target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);exon_id=scaffold10000_size3469_517:2726(+).exon_1
Cheers @bheimbu
I just added the functionality as an optional argument. I don't want to add it by default because I'll have to do a lot of testing that I don't have time for right now but just tested and it addresses the issue for your usage.
compile_metaeuk_identifiers.py -h
usage: compile_metaeuk_identifiers.py -d <output.codon.fas> -a <output.fas> -o <output_directory>
Running: compile_metaeuk_identifiers.py v2024.6.20 via Python v3.9.18 | /Users/jolespin/miniconda3/envs/soothsayer_env/bin/python
optional arguments:
-h, --help show this help message and exit
I/O arguments:
-d CDS, --cds CDS path/to/output.codons.fas [Required]
-a PROTEIN, --protein PROTEIN
path/to/output.fas [Required]
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
path/to/output_directory [Default: metaeuk_output]
-b BASENAME, --basename BASENAME
Base name for gene models [Default: gene_models]
--cds_extension CDS_EXTENSION
CDS fasta extension [Default: ffn]
--protein_extension PROTEIN_EXTENSION
Protein fasta extension [Default: faa]
--no_header Specify if MetaEuk header should not be in fasta record as description
GFF arguments:
-e EXON_COORDINATE_TYPE, --exon_coordinate_type EXON_COORDINATE_TYPE
Exon coordinate type {'low/high', 'taken_low/high'} [Default: low/high]
-m, --include_mrna Include mRNA feature in GFF output
Gene identifier arguments:
--strand_notation STRAND_NOTATION
Strand notation fo gene_id {'+/-','1/-1'} [Default: '+/-']
--no_strand Don't include strand in gene id
Assertion arguments:
--assert_identifiers_are_same_order
Assert that --cds and --protein are the same order
--assert_no_duplicate_headers
If there is an exact overlap of gene positions and strand, the gene with highest bitscore is kept. Choose this if you want to assert there are no duplicates.
Here's the usage:
compile_metaeuk_identifiers.py -a ./metaeuk.fas -d metaeuk.codon.fas -m -o with_mrna
I'll push to dev branch soon.
Feel free to reopen if you need anything else on this topic.
Sry,
I just realized that your attached zip file is empty, can you please re-upload?
Cheers @bheimbu
@bheimbu sorry about that. it's on the dev branch here: https://github.com/jolespin/veba/blob/devel/bin/scripts/compile_metaeuk_identifiers.py
I'll push to main once I get some more updates.
Hi @jolespin,
I've used your script
compile_metaeuk_identifiers.py
mentioned here. It works seamlessly, however I'm wondering where all the mRNAs have gone. See here:…from MetaEuk
…and from
compile_metaeuk_identifiers.py
I know both files are not in the same order. Nevertheless, I cannot find any mRNAs in the above file -- but I need them for downstream analysis?!
Cheers @bheimbu