josuebarrera / GenEra

genEra is a fast and easy-to-use command-line tool that estimates the age of the last common ancestor of protein-coding gene families.
GNU General Public License v3.0
46 stars 6 forks source link

-a option : can we concatenate custom output with precomputed blast against nr ? #20

Open Proginski opened 11 months ago

Proginski commented 11 months ago

Hi,

I mentioned it in a comment but it seems better to make a proper question with it :

Let's imagine I launched genEra on a single proteome (-q ), and now, I would like to add some extra proteomes with -a. As step1 took a long time (diamond blastp vs nr), I would like not to rerun it.

Is it possible to launch diamond blastp --query single_proteome --db extra_proteomes_db -o extra_Diamond_results.bout --outfmt 6 qseqid sseqid evalue bitscore --evalue ${EVALUE} --max-target-seqs 0 # and then, cat extra_Diamond_results ${TAXID}_Diamond_results.bout > tmp mv tmp ${TAXID}_Diamond_results.bout genEra ... -a extra_prot.tsv -p ${TAXID}_Diamond_results.bout ?

From what I have understood, it would be fine...

josuebarrera commented 11 months ago

Dear Paul, It is possible to attach two diamond outputs together using cat and feed them to genEra. The only problem with your approach is that extra_Diamond_results.bout does not contain the fifth column with the NCBI taxonomy IDs of the extra proteomes. This column is essential for the gene age inferences performed in step 3. All of this is done automatically with -a, but genEra needs to run against the NR before integrating the extra proteomes into the analysis. Maybe I can add an argument to skip the search against the NR and run step 1 using only the extra proteomes. Then you could manually attach both diamond results together with cat. Would that implementation be useful to you? Best, Josué.

Proginski commented 11 months ago

Dear Josué,

Yes it would be !

Thanks for your work,

Paul