Closed jolespin closed 3 years ago
Hi,
Thank you for the helpful feedback!
Concerning GFF, @Andy-B-123 has written a Python script that allows converting them to a GFF format. See here.
As for your questions:
createdb
command creates a file in the format of MMseqs2/MetaEuk by the name you give it.AYO43713.1|NODE_2835_length_4878_cov_5.258553|+
). I think it is better to take that into account when numbering.Concerning GFF, @Andy-B-123 has written a Python script that allows converting them to a GFF format. See here.
I'll definitely check this out. Do you have any plans to integrate an official script that does this? I'm hoping that people start paying more attention to euks with all these new tools coming out addressing different issues and feel like this type of output will be requested by others as well. MetaEuk
is definitely by far the most streamlined and easy to use tool (thank you again for making this!) so I'm assuming a lot of people will catch on.
MMseqs2 and MetaEuk do not require a specific file ending. The createdb command creates a file in the format of MMseqs2/MetaEuk by the name you give it.
K this good to know. I guess it was more of a "stylistic" question and asking what you as the developer generally uses. Do you just use the actual fasta name since it adds a bunch of files that suffixed with that basename? Similar to doing bowtie2-build input.fasta input.fasta
makes a bunch of input.fasta.bt2.*
index files.
You can name as you see fit, of course. Just note that the unique combination would be the T+C+S (so in the example you gave AYO43713.1|NODE_2835_length_4878_cov_5.258553|+). I think it is better to take that into account when numbering.
Apologies but can you elaborate a little more on this?
I am sorry. I think I wasn't clear enough. The TCS refers to the 'Target', 'Contig' and 'Strand' combination. We write about it here: https://github.com/soedinglab/metaeuk#terminology. As more than one prediction can be made on a contig and more than one contig can have a prediction based on the same target, the unique identifier is a combination of the three.
I now added GFF output to MetaEuk. I still need to test it a bit but maybe you'd like to have a look too? It is not part of a release so it's not available through conda yet. I followed @Andy-B-123's idea about the meaning of 'exon' and 'CDS'. I still haven't added any description of it to the README.
I just modified it a bit because I saw that start should always be smaller than end in GFF.
Please see release 5-34c21f2
Not sure if this helps but I've made script that does the following:
1) Simplifies the identifiers; 2) Provides custom extensions for cds and protein; 3) Recreates the GFF file entirely and add exon identifiers
I've done this to more easily integrate MetaEuk output into my pipeline that is dependent on prodigal.
Just in case this is useful for anyone I will post below: https://github.com/jolespin/pipelines/blob/master/scripts/compile_metaeuk_identifiers.py
EDIT:
The newest edit mimics the style of BUSCO v5. Also added support for targets with |
characters (e.g., MMETSP) and options to add strand info to handle instances where there is an overlap on gene coords but on different strands. If there are overlaps of gene coords on the same strand, then highest bitscore is taken.
gene_positive_strand = "Transcript_22646|m.42971|NODE_14273_length_2360_cov_2.257701|+|428|5.129e-119|1|98|742|98[98]:742[742]:645[645]"
gene_negative_strand= "Transcript_22646|m.42972|NODE_14273_length_2360_cov_2.257701|-|290|1.787e-77|2|98|742|742[742]:356[356]:387[387]|283[283]:98[98]:186[186]"
I'm trying to integrate this into some pipelines at JCVI and I have to make a lot of post hoc "helper scripts" to do some post-processing.
I wanted to run a few minor feature requests by you:
Questions:
nr
would you use file extensionnr.mmseqs
like w/ diamond who it would benr.dmnd
?AYO43713.1|NODE_2835_length_4878_cov_5.258553|+|1910|0|1|1386|4418|1386[1386]:4418[4418]:3033[3033]
->NODE_2835_length_4878_cov_5.258553_1