jolespin / veba

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes
GNU Affero General Public License v3.0
77 stars 9 forks source link

[Question] Where have all the mRNAs gone? #110

Closed bheimbu closed 5 months ago

bheimbu commented 5 months ago

Hi @jolespin,

I've used your script compile_metaeuk_identifiers.py mentioned here. It works seamlessly, however I'm wondering where all the mRNAs have gone. See here:

…from MetaEuk

scaffold2569_size8692   MetaEuk gene    6442    7410    344 -   .   Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441
scaffold2569_size8692   MetaEuk mRNA    6442    7410    344 -   .   Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_mRNA;Parent=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441
scaffold2569_size8692   MetaEuk exon    7231    7410    64  -   .   Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_exon_0;Parent=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_mRNA
scaffold2569_size8692   MetaEuk CDS 7231    7410    64  -   .   Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_CDS_0;Parent=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_exon_0
scaffold2569_size8692   MetaEuk exon    6886    7158    149 -   .   Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_exon_1;Parent=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_mRNA
scaffold2569_size8692   MetaEuk CDS 6886    7158    149 -   .   Target_ID=UniRef90_UPI0023DCB954;TCS_ID=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_CDS_1;Parent=UniRef90_UPI0023DCB954|scaffold2569_size8692|-|6441_exon_1

…and from compile_metaeuk_identifiers.py

scaffold10000_size3469  MetaEuk gene    518 2727    1039.0  +   .   target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);
scaffold10000_size3469  MetaEuk CDS 518 2727    1039.0  +   .   target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);
scaffold10000_size3469  MetaEuk exon    518 712 1039.0  +   .   target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);exon_id=scaffold10000_size3469_517:2726(+).exon_1
scaffold10000_size3469  MetaEuk exon    760 1704    1039.0  +   .   target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);exon_id=scaffold10000_size3469_517:2726(+).exon_2
scaffold10000_size3469  MetaEuk exon    1762    2055    1039.0  +   .   target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);exon_id=scaffold10000_size3469_517:2726(+).exon_3
scaffold10000_size3469  MetaEuk exon    2109    2507    1039.0  +   .   target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);exon_id=scaffold10000_size3469_517:2726(+).exon_4

I know both files are not in the same order. Nevertheless, I cannot find any mRNAs in the above file -- but I need them for downstream analysis?!

Cheers @bheimbu

jolespin commented 5 months ago

Hmmm I'll take a look at this today. IIRC the design was to mimic the output of Prodigal so I could merge GFFs together but I'll need to double check. I'm looking at the source code and I ended up commenting out the mRNA output line but they are parsed.

You'll have uncomment line 311 here:

https://github.com/jolespin/veba/blob/main/bin/scripts/compile_metaeuk_identifiers.py

I vaguely remember having an issue when running something downstream when including mRNA but I'll run some tests. If there's no issue then I'll restore this functionality. Apologies for the inconvenience.

bheimbu commented 5 months ago

Hi @jolespin,

it's not about uncommenting line 311, you'll have to include a line here actually …

            # mRNA
            mrna_description = "target_id={};tcs_id={};contig_id={};gene_id={};ID={};Parent={};".format(
                id_target,
                id_tcs,
                id_contig,
                id_gene,
                id_gene,
                id_gene,
            )
            mrna_fields = [id_contig, "MetaEuk", "mRNA", start_gene, end_gene, bitscore, strand, ".", mrna_description]

… like this …

            # mRNA
            mrna_description = "target_id={};tcs_id={};contig_id={};gene_id={};ID={};Parent={};".format(
                id_target,
                id_tcs,
                id_contig,
                id_gene,
                id_gene,
                id_gene,
            )
            mrna_fields = [id_contig, "MetaEuk", "mRNA", start_gene, end_gene, bitscore, strand, ".", mrna_description]

            # gff_output.append("\t".join(map(str, mrna_fields)))
            gff_output.append(mrna_fields)

So at least the output looks right …

scaffold10000_size3469  MetaEuk gene    518 2727    1039.0  +   .   target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);
scaffold10000_size3469  MetaEuk mRNA    518 2727    1039.0  +   .   target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);
scaffold10000_size3469  MetaEuk CDS 518 2727    1039.0  +   .   target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);
scaffold10000_size3469  MetaEuk exon    518 712 1039.0  +   .   target_id=UniRef90_UPI0023DC3A1D;tcs_id=UniRef90_UPI0023DC3A1D|scaffold10000_size3469|+;contig_id=scaffold10000_size3469;gene_id=scaffold10000_size3469_517:2726(+);ID=scaffold10000_size3469_517:2726(+);Parent=scaffold10000_size3469_517:2726(+);exon_id=scaffold10000_size3469_517:2726(+).exon_1

Cheers @bheimbu

jolespin commented 5 months ago

I just added the functionality as an optional argument. I don't want to add it by default because I'll have to do a lot of testing that I don't have time for right now but just tested and it addresses the issue for your usage.

compile_metaeuk_identifiers.py -h
usage: compile_metaeuk_identifiers.py -d <output.codon.fas> -a <output.fas> -o <output_directory>

    Running: compile_metaeuk_identifiers.py v2024.6.20 via Python v3.9.18 | /Users/jolespin/miniconda3/envs/soothsayer_env/bin/python

optional arguments:
  -h, --help            show this help message and exit

I/O arguments:
  -d CDS, --cds CDS     path/to/output.codons.fas [Required]
  -a PROTEIN, --protein PROTEIN
                        path/to/output.fas [Required]
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        path/to/output_directory [Default: metaeuk_output]
  -b BASENAME, --basename BASENAME
                        Base name for gene models [Default: gene_models]
  --cds_extension CDS_EXTENSION
                        CDS fasta extension [Default: ffn]
  --protein_extension PROTEIN_EXTENSION
                        Protein fasta extension [Default: faa]
  --no_header           Specify if MetaEuk header should not be in fasta record as description

GFF arguments:
  -e EXON_COORDINATE_TYPE, --exon_coordinate_type EXON_COORDINATE_TYPE
                        Exon coordinate type {'low/high', 'taken_low/high'} [Default: low/high]
  -m, --include_mrna    Include mRNA feature in GFF output

Gene identifier arguments:
  --strand_notation STRAND_NOTATION
                        Strand notation fo gene_id {'+/-','1/-1'} [Default: '+/-']
  --no_strand           Don't include strand in gene id

Assertion arguments:
  --assert_identifiers_are_same_order
                        Assert that --cds and --protein are the same order
  --assert_no_duplicate_headers
                        If there is an exact overlap of gene positions and strand, the gene with highest bitscore is kept. Choose this if you want to assert there are no duplicates.

Here's the usage:

compile_metaeuk_identifiers.py -a ./metaeuk.fas  -d metaeuk.codon.fas -m -o with_mrna

I'll push to dev branch soon.

Feel free to reopen if you need anything else on this topic.

bheimbu commented 5 months ago

Sry,

I just realized that your attached zip file is empty, can you please re-upload?

Cheers @bheimbu

jolespin commented 4 months ago

@bheimbu sorry about that. it's on the dev branch here: https://github.com/jolespin/veba/blob/devel/bin/scripts/compile_metaeuk_identifiers.py

I'll push to main once I get some more updates.