linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
138 stars 40 forks source link

File not found error with Hotpep #38

Closed jaube closed 4 years ago

jaube commented 4 years ago

Hello,

I would like to use dbCAN2 to annotate proteins sequences. I have an error about Hotpep in my log file:

Traceback (most recent call last): File "/home1/datahome/jaube/.local/bin/parallel_group_many_proteins_many_patterns_noDNA.py", line 92, in p.ec = open("%s/%s/%s_group_ec.txt"%(peptide_dir_name, fam, fam)) \ FileNotFoundError: [Errno 2] No such file or directory: '/home1/datahome/jaube/.local/lib/python3.8/site-packages/Hotpep/CAZY_PPR_patterns/GH/GH149/GH149_group_ec.txt' Assigning proteins to groups Collecting Results Assigning proteins to groups Collecting Results Traceback (most recent call last): File "/home1/datahome/jaube/.local/bin/parallel_group_many_proteins_many_patterns_noDNA.py", line 92, in p.ec = open("%s/%s/%s_group_ec.txt"%(peptide_dir_name, fam, fam)) \ FileNotFoundError: [Errno 2] No such file or directory: '/home1/datahome/jaube/.local/lib/python3.8/site-packages/Hotpep/CAZY_PPR_patterns/PL/PL35/PL35_group_ec.txt' Assigning proteins to groups Collecting Results Traceback (most recent call last): File "/home1/datahome/jaube/.local/bin/parallel_group_many_proteins_many_patterns_noDNA.py", line 92, in p.ec = open("%s/%s/%s_group_ec.txt"%(peptide_dir_name, fam, fam)) \ FileNotFoundError: [Errno 2] No such file or directory: '/home1/datahome/jaube/.local/lib/python3.8/site-packages/Hotpep/CAZY_PPR_patterns/GT/GT106/GT106_group_ec.txt' Assigning proteins to groups Collecting Results Traceback (most recent call last): File "/home1/datahome/jaube/.local/bin/parallel_group_many_proteins_many_patterns_noDNA.py", line 92, in p.ec = open("%s/%s/%s_group_ec.txt"%(peptide_dir_name, fam, fam)) \ FileNotFoundError: [Errno 2] No such file or directory: '/home1/datahome/jaube/.local/lib/python3.8/site-packages/Hotpep/CAZY_PPR_patterns/CBM/CBM85/CBM85_group_ec.txt'

Also in the output from the Hotpep run I didn't find any protein from the GH, PL, GT or CBM groups while the annotation with hmmer and diamond found many.

linnabrown commented 4 years ago

It seems you are missing /home1/datahome/jaube/.local/lib/python3.8/site-packages/Hotpep/CAZY_PPR_patterns/*

Could you cd /home1/datahome/jaube/.local/lib/python3.8/site-packages/Hotpep/CAZY_PPR_patterns/(which is the ppr for hotpep) and show me the files under this directory.

jaube commented 4 years ago

The attached file contains the output of the following command: ls -R /home1/datahome/jaube/.local/lib/python3.8/site-packages/Hotpep/CAZY_PPR_patterns > CAZY_PPR_patterns_files.txt The files not found that are named in the error message are not present in the CAZY_PPR_patterns folder of your github. CAZY_PPR_patterns_files.txt

HaidYi commented 4 years ago

I see, we will investigate this problem and give you the feedback ASAP. Thank you for using our tool.

HaidYi commented 4 years ago

Hi Dr. Yin @yinlabniu , these are the files which are missing in PPR.

CAZY_PPR_patterns/GH/GH149/GH149_group_ec.txt
CAZY_PPR_patterns/PL/PL35/PL35_group_ec.txt
CAZY_PPR_patterns/GT/GT106/GT106_group_ec.txt
CAZY_PPR_patterns/CBM/CBM85/CBM85_group_ec.txt
Neato-Nick commented 4 years ago

hotpep output for me has GH, PL, CE, and AA, so I'm missing GTs and CBMs.

Looking at my ppr directories, I noticed I'm mostly missing _group_ec.txt in those two families with no output. Below are the PPR families that are missing a _group_ec.txt file

Weirdly I'm also missing group_ec files for a lot of GH families, but my output contains plenty of those

yinlabniu commented 4 years ago

Another user raised the same issue a few weeks ago and we are working to update the PPR data. PPR does not work on CBMs per their paper, but we should have GTs included. It may be that you input happened to not contain GTs?

Yanbin


From: Nick Carleson notifications@github.com Sent: Wednesday, April 8, 2020 12:00 PM To: linnabrown/run_dbcan run_dbcan@noreply.github.com Cc: Yanbin Yin yyin@unl.edu; Mention mention@noreply.github.com Subject: Re: [linnabrown/run_dbcan] File not found error with Hotpep (#38)

hotpep output for me has GH, PL, CE, and AA, so I'm missing GTs and CBMs.

Looking at my ppr directories, I noticed I'm mostly missing _group_ec.txt in those two families with no output. Below are the PPR families that are missing a _group_ec.txt file

Weirdly I'm also missing group_ec files for a lot of GH families, but my output contains plenty of those

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown_run-5Fdbcan_issues_38-23issuecomment-2D611075381&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=8pAWZD8atl5EMFXExDVWgCCMadppxic8jojcgZAWc4E&s=rNV6RdlR6iJcayWwA9EeLVvMNJHq4oy-VdmEW0BP5JA&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEXNKZWJYZY5X4CMYF27JLLRLSUUTANCNFSM4LVCBQGA&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=8pAWZD8atl5EMFXExDVWgCCMadppxic8jojcgZAWc4E&s=I1V-S0aIJqUYXf6cYBxkvEd9KCIjzHfknX1vFaBVSD8&e=.

Neato-Nick commented 4 years ago

It may be that you input happened to not contain GTs?

Totally possible but unlikely. Out of 2291 rows in my cazy overview output file, 713 had at least one "GT" in either the HMM or DIAMOND columns, 185 rows with high confidence (found a GT using both HMM and DIAMOND, using the 2+ tools criterion as suggested in the dbcan2 paper).

Incidentally this is also the same dataset I mentioned in my GT2 issue https://github.com/linnabrown/run_dbcan/issues/39#issue-596708974

Neato-Nick commented 4 years ago

Actually, is there a small test dataset bundled with this program? I'd be willing to run some proteins guaranteed to return a mix of GT results and check the output.

yinlabniu commented 4 years ago

If there are 185 rows having GTs with hmm+diamond, then it must be hotpep not finding them as GTs due to the problem with PPR patterns. It is very surprising that hotpep didn't find any GTs in your query data. Do you not have any hotpep predicted GTs or just no _group_ec.txt files (this is the problem the other user also had)?

Neato-Nick commented 4 years ago

Surprisingly, most of the GTs do have _group_ec.txt files. I just double-checked, the only families missing in the GT are those three I listed above. So hotpep just didn't find the GTs, maybe I need to run hotpep separately, outside of dbcan, to double check this?

On Sat, Apr 11, 2020 at 10:46 PM Yanbin Yin notifications@github.com wrote:

If there are 185 rows having GTs with hmm+diamond, then it must be hotpep not finding them as GTs due to the problem with PPR patterns. It is very surprising that hotpep didn't find any GTs in your query data. Do you not have any hotpep predicted GTs or just no _group_ec.txt files (this is the problem the other user also had)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/38#issuecomment-612567943, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMUDUX77OHBQGTJIOUH7TTRMFIUFANCNFSM4LVCBQGA .

yinlabniu commented 4 years ago

Yes, it is a good idea to run hotpep separately and see if it can find the GTs. Will you also send us your input seq file? We would also run on our end to see what's going on.

Thanks,

Yanbin


From: Nick Carleson notifications@github.com Sent: Sunday, April 12, 2020 12:12 PM To: linnabrown/run_dbcan run_dbcan@noreply.github.com Cc: Yanbin Yin yyin@unl.edu; Mention mention@noreply.github.com Subject: Re: [linnabrown/run_dbcan] File not found error with Hotpep (#38)

Surprisingly, most of the GTs do have _group_ec.txt files. I just double-checked, the only families missing in the GT are those three I listed above. So hotpep just didn't find the GTs, maybe I need to run hotpep separately, outside of dbcan, to double check this?

On Sat, Apr 11, 2020 at 10:46 PM Yanbin Yin notifications@github.com wrote:

If there are 185 rows having GTs with hmm+diamond, then it must be hotpep not finding them as GTs due to the problem with PPR patterns. It is very surprising that hotpep didn't find any GTs in your query data. Do you not have any hotpep predicted GTs or just no _group_ec.txt files (this is the problem the other user also had)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/38#issuecomment-612567943, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMUDUX77OHBQGTJIOUH7TTRMFIUFANCNFSM4LVCBQGA .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown_run-5Fdbcan_issues_38-23issuecomment-2D612647306&d=DwMFaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=p19Kfn6U6PbcVSXYwHHCCLryITt7IQHFFcX2himmgy8&s=6l73ntrWHNAcaLAyikiXOMWxYwTJQKhlahDPMmiiikg&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEXNKZWFOGSBGO524V2AAX3RMHZABANCNFSM4LVCBQGA&d=DwMFaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=p19Kfn6U6PbcVSXYwHHCCLryITt7IQHFFcX2himmgy8&s=Fi1X5j8Q8jN8QQH3GRU6p4524HDHEkXfaqTn6siEwes&e=.

Neato-Nick commented 4 years ago

I used run_dbcan.py with --tools hotpep. It crashed pretty quickly with the error below ($HOME actually did show the abs path to my home dir)

FileNotFoundError: [Errno 2] No such file or directory: '$HOME/.local/lib/python3.7/site-packages/Hotpep/CAZY_PPR_patterns/GT/GT105/GT105_group_ec.txt'

I emailed you the proteins I've been working with

Thanks!

Neato-Nick commented 4 years ago

For others who find this thread: If hotpep classifies a query protein into a family missing a group_ec.txt file, it will not print any output for that superfamily. In my case, hotpep classified a protein as GT105, the error above shows that this family was missing a group_ec.txt file, and that crashed it for the GT superfamily only.

chassenr commented 4 years ago

Hi, just FYI, I have the same problem working with the latest version of dbcan. When I look at the CAZY_PPR_patterns directory, the *_group_ec.txt files seem to be missing for the following 35 families: AA14,AA15,AA16,CBM82,CBM84,CBM85,GH146,GH147,GH148,GH149,GH150,GH151,GH152,GH153,GH154,GH156,GH158,GH159,GH160,GH161,GH162,GH163,GH164,GH165,GT105,GT106,GT107,PL28,PL29,PL31,PL33,PL34,PL35,PL36,PL37.

Cheers, Christiane

acbellorib commented 4 years ago

Hi there, just adding some information, I'm using dbCAN2 (version 2.0.6) and, after carrying out a similar verification as of Christiane's, I can confirm that the same 35 families are missing regarding the *_group_ec.txt files. The missing file error (GH153_group_ec.txt, for instance) happens when using the EscheriaColiK12MG1655 data example. Apparently, as far as I've inspected, some of these files are also originally missing in the tar package hotpep-python-08-20-2019.tar.gz. Possibly rolling back to the contents of an older dated tar package would do the trick to solve the issue? Thanks for releasing the tool and for your kind attention!

Cheers, Antonio

linnabrown commented 4 years ago

Hi All,

This problem has been solved by adding more ec files. We added one more column called "EC number" in hotpep.out file. If the subfamily does not found out, it will show "NA", otherwise show "x.x.x.x"

Using this command for updating our run_dbcan package, you don't need to install those dependencies again.

Thanks a lot for your feedback!

pip install run-dbcan==2.0.10 --user

Best, Le