josuebarrera / GenEra

genEra is a fast and easy-to-use command-line tool that estimates the age of the last common ancestor of protein-coding gene families.
GNU General Public License v3.0
46 stars 6 forks source link

Trouble with step2 #3

Closed yxj17173 closed 1 year ago

yxj17173 commented 2 years ago

Hello, I want to do some genomic phylostratigraphy analysis of vertebrates data and I tried human data. The first step diamonds worked, I got two temp files, "9606_Diamond_results.bout" and "tmp_9606.abc". But the second step I got these errors.

`Collapsing the phylostrata that are not represented in your DIAMOND results

WARNING: The following phylostrata were collapsed due to lack of sufficient ge nomic data: [Homo sapiens] [Homo] [Homininae] [Hominidae] [Hominoidea] [Catarrhini] [Simii formes] [Haplorrhini] [Primates] [Euarchontoglires] [Boreoeutheria] [Eutheria] [ Theria] [Mammalia] [Amniota] [Tetrapoda] [Dipnotetrapodomorpha] [Sarcopterygii] [Euteleostomi] [Teleostomi] [Gnathostomata] [Vertebrata] [Craniata] [Chordata] [ Deuterostomia] [Bilateria] [Eumetazoa] [Metazoa] [Opisthokonta] [Eukaryota] [cel lular organisms] If you want to include them in your analysis, please add the necessary taxa in a custom database (-a or -f), making sure their last common ancestor to the que ry species can be assigned to that specific taxonomic level in the NCBI taxonomy database

Is this the error due to making blast db? The nr base worked, I had used it for blastp before.

josuebarrera commented 2 years ago

Dear yxj17173, Thank you for your interest in using GenEra! Based on the message that GenEra gave you, I would think that the error is happening in step 1. Could you check whether the 9606_Diamond_results.bout file is empty? If the file is not empty, maybe the taxids were not correctly printed in the file (there should be a 5th column with numbers). As you suggest, the error is most likely related to the database configuration. Consider that the database structure of BLAST is different from the one that is used by DIAMOND (i.e., makeblastdb is not the same as diamond makedb). Therefore, a nr database that was configured for blastp will not work for DIAMOND. The other thing to to take into consideration is to make sure that the nr database was correctly associated with the NCBI taxonomy by using prot.accession2taxid and the taxdump node files. Please make sure that the database was configured this way:

diamond makedb \
 --in nr \
 --db nr \
 --taxonmap prot.accession2taxid \
 --taxonnodes taxdump/nodes.dmp \
 --taxonnames taxdump/names.dmp \
 --memory-limit 100

Please let me know if the 9606_Diamond_results.bout file contains the taxids in the 5th column, so we can figure out what the problem is.

Best, Josué.

yxj17173 commented 2 years ago

The 9606_Diamond_results.bout file is okay, here it is:

截屏2022-08-24 10 00 54

The nr.dmnd is made by diamond makedb, not makeblastdb. I was wrong before, it's a typo.

yxj17173 commented 2 years ago

And in the generated file gene_ages, ncbi_lineages, founder_events, 'rank' is missing.

josuebarrera commented 2 years ago

Dear Xujiang,

Sorry for the late response. As you mention, the Diamond_results.bout file is okay. Could you please send me the entire stdout of that GenEra run? So I can track the source of the error. Also, please let me know the operating system that you are using, so I can replicate the issue.

Best, Josué.

yxj17173 commented 2 years ago

Dear Josue Barrera, In my first result, the step 2 failed, but the Diamond_results.bout file is okay. tax9606.txt In my second result, I skipped step 1, but in the generated filegene_ages,ncbi_lineages,founder_events, 'rank' is missing. tax9606_2.txt My machine is Linux version 3.10.0-1127.19.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) . And I use this human genome GCF_000001405.25_GRCh37.p13_protein.faa. https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.25/ Thanks a lot.

yxj17173 commented 2 years ago

I knew where is my problem, I used tmp.abc not Diamond_results.bout to rerun (skip step1). But another problem occured:

Extracting all the lineages that mached more than 10 percent of your query proteins
sort: 写入失败write failed: /tmp/sortkB5nz7: 设备上没有空间 No space on device

tax9606_5.txt And the generated file gene_ages, ncbi_lineages, founder_events, 'rank' is still missing.

josuebarrera commented 2 years ago

Dear Xujiang,

Thanks for all the information, and for pointing out this error! It looks like the /tmp/ directory ran out of space because it was clogged by a sort command within step 2 of genEra. I just modified genEra to send the sort temporary files to the same space where the Diamond_results.bout and all the other temporary files are sent to (in your case, it would be in tmp_9606_[RANDOMNUM]). Could you please download this version of the genEra and Erassignment scripts and run the same analysis using them?

v1.0.3.tar.gz

And please let me know if this solves your issue, so I can upload this version of genEra as v1.0.3.

Cheers, Josué.

yxj17173 commented 2 years ago

Dear Josué, I had used the new version and It's great! It's still running with no bugs. I had got step2 result(gene ages) to downstream analysis and the step 3 may be a day or two away and I will send the feedback then. Best regards, Xujiang