GenEra run fails at Step 2 - Rearranged ncbi_lineages file not found

gauravdiwan89 commented 1 year ago

Hello,

Really interesting software! I am just giving it a go and trying to determine the gene ages for the human proteome. I am using the fasta file for canonical human proteins from Uniprot and a Swissprot DIAMOND database for this purpose. I already also generated the ncbi_lineages file using ncbitax2lin. Thus the command that I used was

genEra -q proteomes/processed/HUMAN.fa -t 9606 -b data/swissprot -d data/taxdump -i true -o output -n 100 -r output/ncbi_lineages_2023-04-12.csv

Step 1 runs correctly and so does a part of step 2 that creates the tmp_column_* files. However the run fails at the part where a rearranged ncbi_lineage file is created, specifically line 829 of genEra

It seems either the file is not created or there is an issue with opening it later. I do not know which one for sure as the program exits before I can know. The traceback is as follows

STARTING STEP 2: GENERATING TAXONOMIC DATABASE FOR THE PHYLOSTRATIGRAPHIC ASSIGNMENT OF YOUR GENES
--------------------------------------------------
Using the raw "ncbi_lineages" file provided by the user. Skiping ncbitax2lin
--------------------------------------------------
Rearranging the raw "ncbi_lineages" file by taxonomic hierarchy
wget: /software/centos7/devel/anaconda/3/lib/libuuid.so.1: no version information available (required by wget)
/home/.conda/envs/genEra/bin/genEra: line 829: /home/projects/GenEra/tmp_9606_7724/tmp_arranged_output/ncbi_lineages_2023-04-12.csv: No such file or directory
awk: fatal: cannot open file `/home/projects/GenEra/tmp_9606_7724/tmp_arranged_output/ncbi_lineages_2023-04-12.csv' for reading (No such file or directory)
--------------------------------------------------
Extracting all the lineages that match more than 10 percent of your query proteins
awk: cmd. line:1: fatal: cannot open file `/home/projects/GenEra/tmp_9606_7724/tmp_arranged_output/ncbi_lineages_2023-04-12.csv' for reading (No such file or directory)
--------------------------------------------------
Collapsing the phylostrata that are not represented in your DIAMOND results
--------------------------------------------------
Generating the species-tailored database
/home/.conda/envs/genEra/bin/genEra: line 886: /home/projects/GenEra/tmp_9606_7724/tmp_9606_output/ncbi_lineages_2023-04-12.csv: No such file or directory
/home/.conda/envs/genEra/bin/genEra: line 894: output/9606_output/ncbi_lineages_2023-04-12.csv: No such file or directory
rm: cannot remove ‘/home/projects/GenEra/tmp_9606_7724/tmp_arranged_output/ncbi_lineages_2023-04-12.csv’: No such file or directory
rm: cannot remove ‘/home/projects/GenEra/tmp_9606_7724/tmp_9606_output/ncbi_lineages_2023-04-12.csv’: No such file or directory
rm: cannot remove ‘/home/projects/GenEra/tmp_9606_7724/tmp_column_*’: No such file or directory

  ERROR: The species-tailored database is empty! please send me an email to figure out what might be the issue (josue.barrera@tuebingen.mpg.de)
  Exiting

Can you please help me with this?

Thanks!

josuebarrera commented 1 year ago

Dear @gauravdiwan89, Thanks for trying out GenEra! I think there is more than one problem here. One can be attributed to a silly mistake in the GenEra code and the other might have something to do with the OS you are using.

In order for GenEra to rearrange the NCBI taxonomy, it downloads the correct information from the NCBI webpage using wget to rearrange the ncbi_lineages file generated by ncbitax2lin. I think that your OS is having some problems while invoking wget: wget: /software/centos7/devel/anaconda/3/lib/libuuid.so.1: no version information available (required by wget) Could you try running the following command on your machine? wget -q "https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606&lvl=3&p=has_linkout&p=blast_url&p=genome_blast&lin=f&keep=1&srchmode=1&unlock" && wc -l wwwtax.cgi\?mode\=Info\&id\=9606\&lvl\=3\&p\=has_linkout\&p\=blast_url\&p\=genome_blast\&lin\=f\&keep\=1\&srchmode\=1\&unlock | cut -d' ' -f1 The stdout of this command should print 456.

Regarding the silly mistake in GenEra, I just created a patched version and uploaded it to GitHub. Please replace the executable genEra in your machine with this new version and let me know if you still run into trouble.

Best, Josué.

gauravdiwan89 commented 1 year ago

Thanks a lot for the quick response!!

The wget command did work with the exception that the stdout was 461

After replacing the source code for the genEra executable, the program ran successfully!

josuebarrera commented 1 year ago

Perfect, I'm glad that everything worked out!

josuebarrera / GenEra

GenEra run fails at Step 2 - Rearranged ncbi_lineages file not found #7