[Feature Request] Custom protein and codon file extensions (and GFF?) and few minor questions

jolespin commented 3 years ago

I'm trying to integrate this into some pipelines at JCVI and I have to make a lot of post hoc "helper scripts" to do some post-processing.

I wanted to run a few minor feature requests by you:

Custom codon (DNA) file extensions (e.g. ffn)
Custon protein file extensions (e.g. faa)
Also include a GFF file so it can be more easily integrated into pipelines that use prodigal, STAR/HISAT2, and featureCounts. It wouldn't have to replace the headerMap.tsv but it could be included in addition to that even with a --gff argument. At the very least, a test headerMap.tsv -> GFF script that gets installed with the package.

Questions:

What is the preferred file extension for. MMSEQS2 databases? For example, if you have an uncompressed protein fasta file called nr would you use file extension nr.mmseqs like w/ diamond who it would be nr.dmnd?
Does MMSEQS2 return multiple isoforms for the same gene?
Would you advise against renaming some of the proteins in a more filename friendly format like prodigal (potentially because of [2] above)? For example, AYO43713.1|NODE_2835_length_4878_cov_5.258553|+|1910|0|1|1386|4418|1386[1386]:4418[4418]:3033[3033] -> NODE_2835_length_4878_cov_5.258553_1

elileka commented 3 years ago

Hi,

Thank you for the helpful feedback!

Concerning GFF, @Andy-B-123 has written a Python script that allows converting them to a GFF format. See here.

As for your questions:

MMseqs2 and MetaEuk do not require a specific file ending. The createdb command creates a file in the format of MMseqs2/MetaEuk by the name you give it.
MetaEuk does not report isoformas of the same gene. This means that if E1+E2+E3 matched some reference protein but also E1+E3 matched another reference protein (an isoform) than only E1+E2+E3 will be reported (assuming it has a better E-value than E1+E3). This occurs at the redundancy reduction stage.
You can name as you see fit, of course. Just note that the unique combination would be the T+C+S (so in the example you gave AYO43713.1|NODE_2835_length_4878_cov_5.258553|+). I think it is better to take that into account when numbering.

jolespin commented 3 years ago

Concerning GFF, @Andy-B-123 has written a Python script that allows converting them to a GFF format. See here.

I'll definitely check this out. Do you have any plans to integrate an official script that does this? I'm hoping that people start paying more attention to euks with all these new tools coming out addressing different issues and feel like this type of output will be requested by others as well. MetaEuk is definitely by far the most streamlined and easy to use tool (thank you again for making this!) so I'm assuming a lot of people will catch on.

MMseqs2 and MetaEuk do not require a specific file ending. The createdb command creates a file in the format of MMseqs2/MetaEuk by the name you give it.

K this good to know. I guess it was more of a "stylistic" question and asking what you as the developer generally uses. Do you just use the actual fasta name since it adds a bunch of files that suffixed with that basename? Similar to doing bowtie2-build input.fasta input.fasta makes a bunch of input.fasta.bt2.* index files.

You can name as you see fit, of course. Just note that the unique combination would be the T+C+S (so in the example you gave AYO43713.1|NODE_2835_length_4878_cov_5.258553|+). I think it is better to take that into account when numbering.

Apologies but can you elaborate a little more on this?

elileka commented 3 years ago

I am sorry. I think I wasn't clear enough. The TCS refers to the 'Target', 'Contig' and 'Strand' combination. We write about it here: https://github.com/soedinglab/metaeuk#terminology. As more than one prediction can be made on a contig and more than one contig can have a prediction based on the same target, the unique identifier is a combination of the three.

I now added GFF output to MetaEuk. I still need to test it a bit but maybe you'd like to have a look too? It is not part of a release so it's not available through conda yet. I followed @Andy-B-123's idea about the meaning of 'exon' and 'CDS'. I still haven't added any description of it to the README.

I just modified it a bit because I saw that start should always be smaller than end in GFF.

elileka commented 3 years ago

Please see release 5-34c21f2

jolespin commented 2 years ago

Not sure if this helps but I've made script that does the following:

1) Simplifies the identifiers; 2) Provides custom extensions for cds and protein; 3) Recreates the GFF file entirely and add exon identifiers

I've done this to more easily integrate MetaEuk output into my pipeline that is dependent on prodigal.

Just in case this is useful for anyone I will post below: https://github.com/jolespin/pipelines/blob/master/scripts/compile_metaeuk_identifiers.py

EDIT: The newest edit mimics the style of BUSCO v5. Also added support for targets with | characters (e.g., MMETSP) and options to add strand info to handle instances where there is an overlap on gene coords but on different strands. If there are overlaps of gene coords on the same strand, then highest bitscore is taken.

gene_positive_strand = "Transcript_22646|m.42971|NODE_14273_length_2360_cov_2.257701|+|428|5.129e-119|1|98|742|98[98]:742[742]:645[645]"
gene_negative_strand= "Transcript_22646|m.42972|NODE_14273_length_2360_cov_2.257701|-|290|1.787e-77|2|98|742|742[742]:356[356]:387[387]|283[283]:98[98]:186[186]"

soedinglab / metaeuk

[Feature Request] Custom protein and codon file extensions (and GFF?) and few minor questions #32