linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
130 stars 40 forks source link

No hits from DIAMOND #170

Open JosieMainwaring opened 3 months ago

JosieMainwaring commented 3 months ago

Hi all,

I have annotated the example E. coli K12 genome & my genome of interest using run_dbcan on a virtualbox linux system and I had no issues with errors in the code, and the output data files were produced as expected. However, in both cases, there are zero hits in the column for DIAMOND (all just '-' entries, and no hits with all 3 tools), which is unexpected for both genomes.

Does anyone know what might be causing this?

For reference, the diamond version I'm running is 2.0.11

Any help appreciated!

JosieMainwaring commented 3 months ago

Hi, I'm still having this issue. I've tried building the databases using dbcan_build --cpus 8 --db-dir db --clean or by the Database Installation Command, and the problem persists, even though it seems like diamond has been installed. The diamond.out files are not populated. Any help please?

linnabrown commented 2 months ago

Diamond version here is 2.1.9. I just create an new environment and install the dbcan according to our document. It is very strange there is no hits for diamond on your end. I tried this command to run the example E. coli genome, which only choose diamond so no result for EC number, hmmer and dbcan_sub:

run_dbcan EscheriaColiK12MG1655.faa protein --out_dir output_233 -t diamond

Following is the overview result

Gene ID EC#     HMMER   dbCAN_sub       DIAMOND #ofTools
NP_414562.1     -       -       -       GT77    1
NP_414631.1     -       -       -       GT28    1
NP_414632.1     -       -       -       GT28    1
NP_414638.1     -       -       -       CE11    1
NP_414654.1     -       -       -       GH13_3  1
NP_414672.1     -       -       -       CE4     1
NP_414691.1     -       -       -       GT51    1
NP_414724.1     -       -       -       GT19    1
NP_414726.1     -       -       -       GH13_30 1
NP_414736.1     -       -       -       CBM50+GH25      1
NP_414747.1     -       -       -       CBM50+GH23      1
NP_414805.1     -       -       -       GH43_11 1
NP_414845.1     -       -       -       AA3_2   1
NP_414869.1     -       -       -       GH1     1
NP_414877.1     -       -       -       GH36    1
NP_414878.1     -       -       -       GH2     1
NP_414879.3     -       -       -       GH2     1
NP_414897.1     -       -       -       GT2     1
NP_414936.1     -       -       -       GH13_3  1
NP_414937.2     -       -       -       CBM34+GH13_21   1
NP_415006.1     -       -       -       GH152   1
NP_415017.1     -       -       -       CBM50   1
NP_415059.1     -       -       -       GH27    1
NP_415087.1     -       -       -       GH24    1
NP_415101.1     -       -       -       GT0     1
NP_415108.1     -       -       -       GH13_3  1
NP_415118.1     -       -       -       GT2     1
NP_415167.1     -       -       -       GH103   1
NP_415168.1     -       -       -       GH103   1
NP_415175.1     -       -       -       GH13_26 1
NP_415188.1     -       -       -       GT4     1
NP_415203.1     -       -       -       CE9     1
NP_415206.1     -       -       -       CE8     1
NP_415214.1     -       -       -       CBM48+GH13_9    1
NP_415252.1     -       -       -       GT51    1
NP_415254.1     -       -       -       GT2     1
NP_415255.1     -       -       -       GT2     1
NP_415256.1     -       -       -       GT2     1
NP_415257.1     -       -       -       GT22    1
NP_415260.1     -       -       -       GH38    1
NP_415279.1     -       -       -       GH3     1
NP_415293.1     -       -       -       CE8     1
NP_415296.1     -       -       -       AA5_1   1
NP_415403.1     -       -       -       GT4     1
NP_415410.1     -       -       -       GT2     1
NP_415541.1     -       -       -       GT2     1
NP_415542.1     -       -       -       CE4+GH153       1
NP_415543.1     -       -       -       CE4+GH153       1
NP_415567.1     -       -       -       GT2     1

Can you have the same overview result like mine?

JosieMainwaring commented 1 month ago

Thanks for your reply! I updated my Diamon version to 2.1.9 and tried the above and I'm still having the same problem! It runs as expected, and comes up with no errors, but the diamond.out files and diamond column of the overview.txt file are empty still. Any other thoughts???

JosieMainwaring commented 1 month ago

I've just tried from scratch again, setting up a new environment and installing everything again from scratch and still having the same issue :( looking like this data will just be missing from my dissertation! (Which is due next week)

JosieMainwaring commented 1 month ago

I need run_dbcan version too (can't use online) because it's a fungal genome

linnabrown commented 1 month ago

Can you provide the data you are using? That does not make sense diamond no hits

JosieMainwaring commented 1 month ago

Thanks for replying. I've been using the example data to try to get it to work. Have tried both nucelotide and aa sequences, using "run_dbcan EscheriaColiK12MG1655.fna prok --out_dir output_EscheriaColiK12MG1655" as well as the code you provided above: "run_dbcan EscheriaColiK12MG1655.faa protein --out_dir output_233 -t diamond"

And running my query data gave the same issue

yinlabniu commented 1 month ago

Sounds like your diamond might have not actually worked. If you run "diamond help", do you see the help information?

If you do, you can try to run diamond on your protein file directly on the command line (i.e., not using run_dbcan), like "diamond blastp -d {cazy_indexfile} -e {dia_eval} -q {yourfaafile} -k 1 -o diamond.out -f 6". Let me know if you see any output in diamond.out.

Yanbin


From: JosieMainwaring @.> Sent: Tuesday, May 21, 2024 11:39 PM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] No hits from DIAMOND (Issue #170)

Caution: Non-NU Email

Thanks for replying. I've been using the example data to try to get it to work. Have tried both nucelotide and aa sequences, using "run_dbcan EscheriaColiK12MG1655.fna prok --out_dir output_EscheriaColiK12MG1655" as well as the code you provided above: "run_dbcan EscheriaColiK12MG1655.faa protein --out_dir output_233 -t diamond"

And running my query data gave the same issue

— Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/170#issuecomment-2123851476, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXNKZXUJOJZYODAGOPCQXLZDQOORAVCNFSM6AAAAABEYSOU6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTHA2TCNBXGY. You are receiving this because you are subscribed to this thread.Message ID: @.***>

JosieMainwaring commented 1 month ago

Yes, I see the following: " (dbcan3) tup@Tuptop-VirtualBox:~$ diamond help diamond v2.1.9.163 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

Syntax: diamond COMMAND [OPTIONS]

Commands: makedb Build DIAMOND database from a FASTA file prepdb Prepare BLAST database for use with Diamond blastp Align amino acid query sequences against a protein reference database blastx Align DNA query sequences against a protein reference database cluster Cluster protein sequences linclust Cluster protein sequences in linear time realign Realign clustered sequences against their centroids recluster Recompute clustering to fix errors reassign Reassign clustered sequences to the closest centroid view View DIAMOND alignment archive (DAA) formatted file merge-daa Merge DAA files help Produce help message version Display version information getseq Retrieve sequences from a DIAMOND database file dbinfo Print information about a DIAMOND database file test Run regression tests makeidx Make database index greedy-vertex-cover Compute greedy vertex cover

Possible [OPTIONS] for COMMAND can be seen with syntax: diamond COMMAND

Online documentation at http://www.diamondsearch.org " I'll try this for the example data, but for my query sequence I don't have an amino acid file unfortunately!

JosieMainwaring commented 1 month ago

What do I input for cazy_indexfile and dia_eval ?

linnabrown commented 1 month ago

Can you install the docker version? This is the fastest way.

JosieMainwaring commented 1 month ago

I haven't tried the docker version yet - not familiar with Docker at all. But I'll give it a try

Edit: Will it be fastest for a noob who doesn't yet have Docker installed?

JosieMainwaring commented 1 month ago

I don't have space on my computer to pull the haidyi/run_dbcan image for Docker setup - I'll have to try through my university HPC tomorrow! Thanks for help so far guys. It's the last piece of data I need - all just to write a couple of numbers into a table! Will be back tomorrow

yinlabniu commented 1 month ago

So you do have a working diamond installed, but you don't have a protein fasta file. This is likely the reason that you don't have any result in diamond.out (I also have doubt that you will have meaningful result in hmmer.out). For eukaryotic genomes, we suggested that you use protein instead of nucleotide input. This is from the help page of dbCAN web server: [cid:e0856152-b189-4e22-a5d7-1a8cae1c6fab]

Within run_dbcan, we call prodigal to predict protein coding genes if users input the genome nt fasta. But prodigal is for prokaryote/phage genomes but not designed for eukaryotes, so we do not recommend users use nt input for run_dbcan. Instead you should predict proteins outside of run_dbcan.

Yanbin


From: JosieMainwaring @.> Sent: Wednesday, May 22, 2024 12:02 AM To: linnabrown/run_dbcan @.> Cc: Yanbin Yin @.>; Comment @.> Subject: Re: [linnabrown/run_dbcan] No hits from DIAMOND (Issue #170)

Caution: Non-NU Email

Yes, I see the following: " (dbcan3) @.***:~$ diamond help diamond v2.1.9.163 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen Documentation, support and updates available at http://www.diamondsearch.org Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

Syntax: diamond COMMAND [OPTIONS]

Commands: makedb Build DIAMOND database from a FASTA file prepdb Prepare BLAST database for use with Diamond blastp Align amino acid query sequences against a protein reference database blastx Align DNA query sequences against a protein reference database cluster Cluster protein sequences linclust Cluster protein sequences in linear time realign Realign clustered sequences against their centroids recluster Recompute clustering to fix errors reassign Reassign clustered sequences to the closest centroid view View DIAMOND alignment archive (DAA) formatted file merge-daa Merge DAA files help Produce help message version Display version information getseq Retrieve sequences from a DIAMOND database file dbinfo Print information about a DIAMOND database file test Run regression tests makeidx Make database index greedy-vertex-cover Compute greedy vertex cover

Possible [OPTIONS] for COMMAND can be seen with syntax: diamond COMMAND

Online documentation at http://www.diamondsearch.org " I'll try this for the example data, but for my query sequence I don't have an amino acid file unfortunately!

— Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/170#issuecomment-2123871357, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXNKZTLYQYDN3RGDUKFFNLZDQRHDAVCNFSM6AAAAABEYSOU6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTHA3TCMZVG4. You are receiving this because you commented.

JosieMainwaring commented 1 month ago

That makes sense for the query sequence, but why would the example E. coli data not work either? Including with the amino acid file? If I can get the example data working, then I still have hope for my query sequence. I'd just have to translate it to .faa by other means, right?

yinlabniu commented 1 month ago

E.coli data should work. Yes, you can check if using ecoli protein file would work. Commands are at here https://dbcan.readthedocs.io/en/latest/user_guide/index.html. There are example files here https://bcb.unl.edu/dbCAN2/download/Samples/.


From: JosieMainwaring @.> Sent: Wednesday, May 22, 2024 2:58 PM To: linnabrown/run_dbcan @.> Cc: Yanbin Yin @.>; Comment @.> Subject: Re: [linnabrown/run_dbcan] No hits from DIAMOND (Issue #170)

Caution: Non-NU Email

That makes sense for the query sequence, but why would the example E. coli data not work either? Including with the amino acid file? If I can get the example data working, then I still have hope for my query sequence. I'd just have to translate it to .faa by other means, right?

— Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/170#issuecomment-2125643816, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXNKZXP3BDB3TLR44VCYWDZDT2HRAVCNFSM6AAAAABEYSOU6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVGY2DGOBRGY. You are receiving this because you commented.Message ID: @.***>

JosieMainwaring commented 1 month ago

Thanks everyone for your help. I got everything (example data & query) working just by running all the same steps on my HPC. For whatever reason Diamond was just determined to be broken on my linux. So, not solved but worked around.

linnabrown commented 1 month ago

Again, highly recommend to use docker image when you confront this issue next time. Each person might change the configuration of his/her system which might ruin the installation for other software. Since docker won't ruin your linux system and it created its own linux system already @JosieMainwaring