linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
138 stars 40 forks source link

database setting #15

Closed ucassee closed 5 years ago

ucassee commented 5 years ago

I use git clone to download run_dbcan. But I don't know how to set the database path in run_dbcan.py. I run it with error. db/ ERROR: The database directory does not exist Could you help me?

Thanks in advance!

linnabrown commented 5 years ago

Please refer to readme and you could find the following relavent databases URL of downloading. http://bcb.unl.edu/dbCAN2/download/Databases/

DATABASES and Formatting[required!]Link

CAZyDB.07312018.fa--use diamond makedb --in CAZyDB.07312018.fa -d CAZy

[PPR]:included in Hotpep

dbCAN-HMMdb-V7.txt--First use mv dbCAN-HMMdb-V7.txt dbCAN.txt, then use hmmpress dbCAN.txt

tcdb.fa--use diamond makedb --in tcdb.fa -d tcdb

tf-1.hmm--use hmmpress tf-1.hmm

tf-2.hmm--use hmmpress tf-2.hmm

stp.hmm--use hmmpress stp.hmm
ucassee commented 5 years ago

@linnabrown Thanks for your reply~ I downloaded all of these files in /home/zhouyl/database/dbCAN2.

-rw-r--r-- 1 zhouyl microbial 688041393 Aug 9 10:33 CAZyDB.07312019.fa -rw-r--r-- 1 zhouyl microbial 828137 Aug 9 05:01 CAZyDB.07312019.fam.subfam.ec.txt -rw-r--r-- 1 zhouyl microbial 710229619 Sep 4 11:27 CAZy.dmnd -rw-r--r-- 1 zhouyl microbial 94320447 Aug 9 04:02 dbCAN-HMMdb-V8.txt -rw-r--r-- 1 zhouyl microbial 94320447 Sep 4 11:35 dbCAN.txt -rw-r--r-- 1 zhouyl microbial 17193669 Sep 4 11:37 dbCAN.txt.h3f -rw-r--r-- 1 zhouyl microbial 32469 Sep 4 11:37 dbCAN.txt.h3i -rw-r--r-- 1 zhouyl microbial 39282326 Sep 4 11:37 dbCAN.txt.h3m -rw-r--r-- 1 zhouyl microbial 46146645 Sep 4 11:37 dbCAN.txt.h3p -rw-r--r-- 1 zhouyl microbial 11104988 Dec 23 2018 stp.hmm -rw-r--r-- 1 zhouyl microbial 2863601 Sep 4 11:45 stp.hmm.h3f -rw-r--r-- 1 zhouyl microbial 12315 Sep 4 11:45 stp.hmm.h3i -rw-r--r-- 1 zhouyl microbial 4601853 Sep 4 11:45 stp.hmm.h3m -rw-r--r-- 1 zhouyl microbial 5411888 Sep 4 11:45 stp.hmm.h3p -rw-r--r-- 1 zhouyl microbial 8255344 Sep 4 11:43 tcdb.dmnd -rw-r--r-- 1 zhouyl microbial 8122156 Dec 31 2017 tcdb.fa -rw-r--r-- 1 zhouyl microbial 9273383 Jul 22 2018 tf-1.hmm -rw-r--r-- 1 zhouyl microbial 2373341 Sep 4 11:44 tf-1.hmm.h3f -rw-r--r-- 1 zhouyl microbial 10177 Sep 4 11:44 tf-1.hmm.h3i -rw-r--r-- 1 zhouyl microbial 3833487 Sep 4 11:44 tf-1.hmm.h3m -rw-r--r-- 1 zhouyl microbial 4516201 Sep 4 11:44 tf-1.hmm.h3p -rw-r--r-- 1 zhouyl microbial 6362734 Apr 4 19:31 tf-2.hmm -rw-r--r-- 1 zhouyl microbial 2175741 Sep 4 11:45 tf-2.hmm.h3f -rw-r--r-- 1 zhouyl microbial 8518 Sep 4 11:45 tf-2.hmm.h3i -rw-r--r-- 1 zhouyl microbial 2609575 Sep 4 11:45 tf-2.hmm.h3m -rw-r--r-- 1 zhouyl microbial 3146860 Sep 4 11:45 tf-2.hmm.h3p

But I don't know how to add the path for run_dbcan.py. Where should I put these databases for run_dbcan.py to call?

I use --db_dir to appoint the db path, it works. Thanks~

ucassee commented 5 years ago

I went new error with

Preparing overview table from hmmer, hotpep and diamond output... Traceback (most recent call last): File "/home/zhouyl/software/run_dbcan/run_dbcan.py", line 553, in for i in range(1,len(arr_hotpep)): NameError: name 'arr_hotpep' is not defined

But I see result file as follow. If I just want to annotate the CAZyme, can I ignore this error?

Gene ID CAZy ID % Identical Length Mismatches Gap Open Gene Start Gene End CAZy Start CAZy End E Value Bit Score B9T1B5_NODE_220_length_60920_cov_7.659610_220_37 AHO16406.1|GH13_11| 56.2 365 154 4 8 368 5 367 3.2e-110 401.7 B9T1B5_NODE_251_length_57415_cov_6.903348_251_23 QBH12981.1|GT51| 45.9 726 339 8 4 724 104 780 4.7e-173 611.3

linnabrown commented 5 years ago

No, you can't ignore it. There file is exsisted and maybe you need to authorize executor for file r/w.

Try this: sudo chmod 755 -R <directory-name>

Good night.

ucassee commented 5 years ago

I use python /home/zhouyl/software/run_dbcan/run_dbcan.py B10.faa protein --out_dir B10 --db_dir /home/zhouyl/software/run_dbcan/db to run. There are these output files:

-rw-r--r-- 1 zhouyl microbial 1599 Sep 5 09:54 diamond.out -rw-r--r-- 1 zhouyl microbial 163501 Sep 5 09:54 h.out -rw-r--r-- 1 zhouyl microbial 239219 Sep 5 09:52 signalp.neg -rw-r--r-- 1 zhouyl microbial 242579 Sep 5 09:52 signalp.pos -rw-r--r-- 1 zhouyl microbial 910837 Sep 5 09:51 uniInput

Should I manually combine diamond.out and h.out.

I try to run hmmer,diamond,hotpep respectively. But it went error with hmmer and hotpep.

python: can't open file 'hmmscan-parser.py': [Errno 2] No such file or directory Waiting on signalP SignalP complete Preparing overview table from hmmer, hotpep and diamond output... Traceback (most recent call last): File "/home/zhouyl/software/run_dbcan/run_dbcan.py", line 548, in for i in range(1,len(arr_diamond)): NameError: name 'arr_diamond' is not defined

File "/home/zhouyl/software/run_dbcan/run_dbcan.py", line 553, in for i in range(1,len(arr_hotpep)): NameError: name 'arr_hotpep' is not defined

ucassee commented 5 years ago

I still don't know which folder's authority I should change and I don't have root privilege. So I install all dependent package in my own path. I try to run run_dbcan.py under its folder all environment variable is correct. The output file as follow:

-rw-r--r-- 1 zhouyl microbial 1712 Sep 5 10:44 diamond.out -rw-r--r-- 1 zhouyl microbial 3084 Sep 5 10:44 hmmer.out -rw-r--r-- 1 zhouyl microbial 1932 Sep 5 10:44 Hotpep.out -rw-r--r-- 1 zhouyl microbial 1805 Sep 5 10:44 overview.txt -rw-r--r-- 1 zhouyl microbial 74430 Sep 5 10:44 signalp.out -rw-r--r-- 1 zhouyl microbial 910837 Sep 5 10:41 uniInput

But this way is really inconvenience. How can I add hmmscan-parser.py, CGCFinder.py..... these file to my bashrc for run_dbcan.py to call. I try to export PATH=,but it failed. I can only call these file when run run_dbcan.py under its folder.

In overview.txt file, you summary hmmer,diamond,hotpep result. Could you give me some advice to set a threshold to filter the result. I just want to know the CAZyme in my metagenome.

linnabrown commented 5 years ago

Yes, you need to enter the root with run_dbcan.py to run our script. The results below are correct. Among these results, overview.txt is the final one summarizing all three outputs.

-rw-r--r-- 1 zhouyl microbial 1712 Sep 5 10:44 diamond.out -rw-r--r-- 1 zhouyl microbial 3084 Sep 5 10:44 hmmer.out -rw-r--r-- 1 zhouyl microbial 1932 Sep 5 10:44 Hotpep.out -rw-r--r-- 1 zhouyl microbial 1805 Sep 5 10:44 overview.txt -rw-r--r-- 1 zhouyl microbial 74430 Sep 5 10:44 signalp.out -rw-r--r-- 1 zhouyl microbial 910837 Sep 5 10:41 uniInput

Please enter the run_dbcan directory to run our script because you do not need to make any other configuration for hmmscan-parser.py and CGCFinder.py(called by run_dbcan.py).

cd /home/zhouyl/software/run_dbcan/run_dbcan.py
python run_dbcan.py B10.faa protein --out_dir B10 --db_dir /home/zhouyl/software/run_dbcan/db

this is not inconvience only you enter the directory of run_dbcan.py cd /home/zhouyl/software/run_dbcan/run_dbcan.py .

Alternatively, you could use our fully prepared docker version run_dbcan.py to make it easier. You don't need to download any extra files and make convoluted configuration any more. Here you go.

1. Make sure docker is installed on your computer successfully.
2. docker pull haidyi/run_dbcan:latest
3. docker run --name <preferred_name> -v <host-path>:<container-path> -it haidyi/run_dbcan:latest python run_dbcan.py <input_file> [params] --out_dir <output_dir>

Thank you for your feedback, if you have any question please do not hesitate to reply.

ucassee commented 5 years ago

@linnabrown Thanks for your patience. I want to know the default threshold of these three methods (hmmer,diamond,hotpep). How can I modify the threshold setting. I really hope you can give me some advice to filter the result in overview.txt. For examples, if a protein annotated only by DIAMOND method, can I trust this annotation? if a protein have different annotation results by different methods, which should I trust?

linnabrown commented 5 years ago
  1. Please refer to my readme, it presents all various parameters of thresholds(e.g. HMMER E Value, HMMER Coverage value). Parts of them pasted, here you go:
    
    [--hmm_eval]    - optional, allows user to set the HMMER E Value. Default = 1e-15.

[--hmm_cov] - optional, allows user to set the HMMER Coverage value. Default = 0.35.

[--hmm_cpu] - optional, allows user to set how many CPU cores HMMER can use. Default = 1.

[--hot_hits] - optional, allows user to set the Hotpep Hits value. Default = 6.

[--hot_freq] - optional, allows user to set the Hotpep Frequency value. Default = 2.6.

[--hot_cpu] - optional, allows user to set how many CPU cores Hotpep can use. Default = 3.

[--tf_eval] - optional, allows user to set tf.hmm HMMER E Value. Default = 1e-4.

[--tf_cov] - optional, allows user to set tf.hmm HMMER Coverage val. Default = 0.35.

[--tf_cpu] - optional, allows user to tf.hmm Number of CPU cores that HMMER is allowed to use. Default = 1.

[--stp_eval] - optional, allows user to set stp.hmm HMMER E Value. Default = 1e-4.

[--tf_cov] - optional, allows user to set stp.hmm HMMER Coverage val. Default = 0.3.

[--tf_cpu] - optional, allows user to stp.hmm Number of CPU cores that HMMER is allowed to use. Default =


2.  No tools can give you 100% accurate result. Our tool integrates three tools using our own threshold to help user predict CAZyme families. Diamond, Hmmer and Hotpep sometimes generate different predictions of family, so you can choose the family which all three generated, or just two of them generated.
ucassee commented 5 years ago

@linnabrown Thanks for your reply. As I do metagenome analysis, I can't verify all these annotations. So I just want to get a relative accurate result by threshold setting. Maybe the default setting is the suitable setting by your test. I also want to know tf.hmm HMMER E Value is used in which methods(hmmer,diamond,hotpep). Its default setting is just 1e-4(permissive).

linnabrown commented 5 years ago

tf.hmm is used for HMMER.

You can test it by yourself to choose the best threshold, because the best threshold choice is the the conclusion of experiments(https://doi.org/10.1093/nar/gky418) in bacterial genomes.

ucassee commented 5 years ago

What is the differences among--tf_eval--hmm_eval and --stp_eval ?

linnabrown commented 5 years ago

--hmm_eval is used for hmmer among three tools. --tf_eval and --stp_eval is used for CGC algorithm, which is another algorithm designed by us used for finding CGC cluster, different from above. If you want to know more details about my code, please refer to our paper and README.md.

ucassee commented 4 years ago

Hi @linnabrown , according dbcan annotation, I just have cazyes class number. But if I want to know the detailed annotation, do you know where I can get it? Thanks in advance~

yinlabniu commented 4 years ago

Please see this file: http://bcb.unl.edu/dbCAN2/download/Databases/CAZyDB.07312019.fam-activities.txt


From: Yingli Zhou notifications@github.com Sent: Wednesday, September 11, 2019 9:38 PM To: linnabrown/run_dbcan run_dbcan@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [linnabrown/run_dbcan] database setting (#15)

Hi @linnabrownhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=7dNzCOxIN_Pdw2CjJ6kmoOyt1JYO3xV01IOIIHOkMAg&s=3JXNKlfRlADPfbAIRWrx3W65LBUD1_OQTFowZCG0nTc&e= , according dbcan annotation, I just have cazyes class number. But if I want to know the detailed annotation, do you know where I can get it? Thanks in advance~

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown_run-5Fdbcan_issues_15-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAEXNKZTWVXALBGT73ME7SWLQJGTSTA5CNFSM4ITNB6TKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6QO2TY-23issuecomment-2D530640207&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=7dNzCOxIN_Pdw2CjJ6kmoOyt1JYO3xV01IOIIHOkMAg&s=ffFkBD2WVkpNrnBpGdi06w8CO8vfpUfD6uGRxISbCxk&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEXNKZUN42U34SOENXHEJG3QJGTSTANCNFSM4ITNB6TA&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=7dNzCOxIN_Pdw2CjJ6kmoOyt1JYO3xV01IOIIHOkMAg&s=1PkKZaOkdYOnu7iKURYlWMoBmJPiWo6PGX4nAXwcStM&e=.