hyattpd / Prodigal

Prodigal Gene Prediction Software
GNU General Public License v3.0
446 stars 85 forks source link

New version discussion: input/output formats #63

Open hyattpd opened 5 years ago

hyattpd commented 5 years ago

Discussion of input/output in new version.

What formats would you like to see supported?

Current proposal:

Does anyone want or need FASTQ support? No one's ever requested it. Prodigal currently has a crappy Genbank/EMBL sequence parser, but I'm not really sure this is a feature that should be supported (FASTA seems fine).

Any others? .xz?

How important is it to allow standard input? Would people be disappointed if they had to specify files?

Any other formats people would like to see supported?

Any other formats needed here?

tseemann commented 5 years ago
tseemann commented 5 years ago

Support for masked FASTA would be useful. It is used by BLAST+ and many other tools. Lowecase bases will be ignored. Could be treated as N in Prodigal 3.0 ?

hyattpd commented 5 years ago

Yeah, this is essential, especially for doing eukaryotic gene prediction.

tseemann commented 5 years ago

Is prodigal 3.0 (prok) the same as radigal 1.0 (euk) ?

hyattpd commented 5 years ago

I'm leaning towards just calling the whole thing radigal.

oschwengers commented 5 years ago

As Torsten said:

As we're very often parsing Prodigal's output in our pipelines instead of merely passing it over to 3rd party executables: a very simple tab separated format including the most important information would be nice, e.g.: gene id, contig, start, stop, strand, partial?, shifted?, nuc seq, aa seq

I know, GFF3 is very close but either the sequence is not included so the ffn/faa files need to be parsed as well or if they are included for multi contig files the format gets more complex than it has to be (my pers. opinion). So this way one would have everything in place in a simple straight-forward manner, at least for proks.

Another idea would be to have everything Prodigal/Radigal can provide in a well structured JSON format. This way you have everything in place in a machine readable format that is well supported by every modern language.

tseemann commented 5 years ago

I partially agree with the TSV/TAB output but I would hope GFF or BED could be used so bedtools and samtools will work with it.

+1 for JSON format!

oschwengers commented 5 years ago

What about using the "simple tab" for stdout and everything else as optional parameters? Idea: prodigal3 --input [--output ] [--prefix ] --json --gff3 --bed... Then in there is , , , etc...

Would be simple, flexible and straight forward. Of course, the simple tab could also be a non-standard option, e.g. --tsv

ayixon commented 4 years ago

¿Can i feed prodigal with multiple genomes at once? ¿something like $ prodigal -i *.faa? How can i set the output on individual files?

tseemann commented 4 years ago

.faa files are usually reserved for peptide sequences.

I don't think prodigal takes multiple input files at once. You can either concatenate first: cat *.faa > everything.fasta or try a bash subshell: prodigal .... <(cat *.faa)