Open szzxpxf opened 4 years ago
Same problem here. I tried with the E.coli example :
run_dbcan.py --cluster E.coli.gff E.coli.faa protein --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.bed E.coli.faa protein --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.gff E.coli.fna meta --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.bed E.coli.fna meta --db_dir ~/Databases/db
and the cgc.out file always comes empty
Hi,
Along those above lines, I was wondering whether it is possible to obtain cgc.out
based on a previous run_dbcan.py
output consisting of only:
diamond.out
hmmer.out
overview.txt
prodigal.gff
uniInput
That is, would there be an easy way to run only the following without re-running DIAMOND and HMMER?:
run_dbcan.py uniInput protein --out_dir output --cluster prodigal.gff --db_dir ~/databases/dbCAN/db
I will check the stuffs during weekends. Thank you all for your feedback
As for @lalalagartija, I end up with empty cgc.out
. Here's what I did, with the latest version of run_dbcan.py
:
run_dbcan.py contigs.fasta prok \
--out_dir outdir \
-t diamond hmmer \
--db_dir ~/databases/dbCAN/db \
--hmm_cpu 12 \
--dia_cpu 12
Which produced the following files:
diamond.out
hmmer.out
overview.txt
prodigal.gff
uniInput
# copy the file away as re-running overwrites (I just wanted to check for this behavior)
mv outdir/diamond.out outdir/diamond_1.out
mv outdir/hmmer.out outdir/hmmer_1.out
mv outdir/overview.txt outdir/overview_1.txt
mv outdir/uniInput outdir/uniInput_1
run_dbcan.py uniInput_1 protein \ --out_dir outdir \ --db_dir ~/databases/dbCAN/db \ --hmm_cpu 16 \ --dia_cpu 16 \ --cluster outdir/prodigal.gff
Which produced:
17M Sep 11 22:42 cgc.gff 0 Sep 11 22:42 cgc.out 500K Sep 11 22:42 diamond_1.out 500K Sep 11 22:42 diamond.out 587K Sep 11 22:42 hmmer_1.out 587K Sep 11 22:42 hmmer.out 1.3M Sep 11 22:42 Hotpep.out 421K Sep 11 22:42 overview_1.txt 505K Sep 11 22:42 overview.txt 68M Sep 11 22:42 prodigal.gff 2.5M Sep 11 22:42 stp.out 626K Sep 11 22:42 tf-1.out 501K Sep 11 22:42 tf-2.out 2.1M Sep 11 22:42 tp.out 75M Sep 11 22:42 uniInput 75M Sep 11 22:42 uniInput_1
See the empty `cgc.out` file.
Maybe it is wrong to re-run on the .gff file produced in the first run?
Thanks for the help and the great tool!
Hi @linnabrown. Just wondering if you think you will have time to look at this. Otherwise, I would dig in the code and try to adapt a solution for my needs :) Thank you and hope things are going well for you.
So strange. I used this command but I can have result in cgc.out
run_dbcan.py EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655
And I don't have this issue like yours when protein data is input.
Please try this command and let me know
python CGCFinder.py output_EscheriaColiK12MG1655/cgc.gff -o cgc.out -s tp -d 2
I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command:
run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0
Finally I got my results. Hope this feedback is helpful.
@iaunicorn Thank you for the hint!
Indeed, in my .gff file the sequences IDs are labeled as such
NODE_1_length_998062_cov_18.990727 Prodigal_v2.6.3 CDS 30 830 52.0 + 0 ;ID=1_1
whereas these IDs in the fasta file, the proteins are labeled with the ID# appended:
>NODE_1_length_998062_cov_18.990727_1 # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366
Did you edit your proteins.faa
input so that the fasta header match exactly the .gff entries?
i.e. in my case, leaving only >NODE_1_length_998062_cov_18.990727_1
(and without # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366
)?
Prefer asking in advance as my input is very large... Thanks!
@linnabrown I did not get the chance to run the test commands you shared but I will to verify that my issue could be the proteins naming. Thanks again!
Hi,
Thanks for using our tool. Our gff file should be gff3 format.
Best, Le
From: Lejzerowicz notifications@github.com Date: Monday, November 23, 2020 at 1:03 PM To: linnabrown/run_dbcan run_dbcan@noreply.github.com Cc: Huang, Le lehuang@email.unc.edu, Mention mention@noreply.github.com Subject: Re: [linnabrown/run_dbcan] CGC-Finder only works with fna input (#50)
@iaunicornhttps://github.com/iaunicorn Thank you for the hint!
Indeed, in my .gff file the sequences IDs are labeled as such
NODE_1_length_998062_cov_18.990727 Prodigal_v2.6.3 CDS 30 830 52.0 + 0 ;ID=1_1
whereas these IDs in the fasta file, the proteins are labeled with the ID# appended:
NODE_1_length_998062_cov_18.990727_1 # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366
Did you edit your proteins.faa input so that the fasta header match exactly the .gff entries? i.e. in my case, leaving only >NODE_1_length_998062_cov_18.990727_1 (and without # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366 )? Prefer asking in advance as my input is very large... Thanks!
@linnabrownhttps://github.com/linnabrown I did not get the chance to run the test commands you shared but I will to verify that my issue could be the proteins naming. Thanks again!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/50#issuecomment-732330139, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMHALVIHR5TAX6JFC7AECDSRKPX5ANCNFSM4OHAQIQA.
OK, it seems I have been using Prodigal V2.6.3
, while GFF3 might only be an output of Prodigal.v3.0.0
...
Weird because I created a new conda env with it, following the install procedure conda create -n run_dbcan python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda
.
I can look it up, but do you know whether it is possible to convert gff to gff3 format? If yes, do you know of some scripts doing this?
Thanks.
http://www.sequenceontology.org/cgi-bin/converter.cgi
From: Lejzerowicz notifications@github.com Date: Monday, November 23, 2020 at 1:17 PM To: linnabrown/run_dbcan run_dbcan@noreply.github.com Cc: Huang, Le lehuang@email.unc.edu, Mention mention@noreply.github.com Subject: Re: [linnabrown/run_dbcan] CGC-Finder only works with fna input (#50)
OK, it seems I have been using Prodigal V2.6.3, while GFF3 might only be an output of Prodigal.v3.0.0... Weird because I created a new conda env with it, following the install procedure conda create -n run_dbcan python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda.
I can look it up, but do you know whether it is possible to convert gff to gff3 format? If yes, do you know of some scripts doing this?
Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/50#issuecomment-732337997, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMHALSFQZWBFKWRU7E6L2DSRKRKNANCNFSM4OHAQIQA.
Answering my own question. To convert GFF to GFF3;
https://bmi.inf.ethz.ch/supplements/gff-tools
Best, F
@FranckLejzerowicz Yes, I edited my proteins.faa input.
See issue #78
Hello @linnabrown. Sorry to resurrect this old issue. I'm having the same problems with run_dbcan v.2.0.11, installed via conda following your instructions. Here are my commands and outputs:
source /home/miniconda3/bin/activate
conda create -n dbcan2 python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda
conda activate dbcan2
pip install run-dbcan==2.0.11
Databases were installed as outlined in the README.
export OMP_NUM_THREADS=8
faa=/home/dbcan2/EscheriaColiK12MG1655.faa
gff=/home/dbcan2/EscheriaColiK12MG1655.gff
OUT=/workdir/dbcan2
mkdir -p $OUT
cd $OUT
source /home/miniconda3/bin/activate
conda activate dbcan2
run_dbcan.py $faa protein \
--out_dir test_faa \
-c $gff \
-t all \
--dia_cpu $OMP_NUM_THREADS \
--hmm_cpu $OMP_NUM_THREADS \
--hotpep_cpu $OMP_NUM_THREADS \
--tf_cpu $OMP_NUM_THREADS \
--db_dir /home/dbcan2/db
cgc.gff cgc.out overview.txt stp.out tf-1.out tf-2.out tp.out uniInput
cgc.out
is empty, and overview.txt
only includes column names.
export OMP_NUM_THREADS=8
fna=/home/dbcan2/EscheriaColiK12MG1655.fna
gff=/home/dbcan2/EscheriaColiK12MG1655.gff
OUT=/workdir/dbcan2
mkdir -p $OUT
cd $OUT
source /home/miniconda3/bin/activate
conda activate dbcan2
run_dbcan.py $fna prok \
--out_dir test_fna \
-c $gff \
-t all \
--dia_cpu $OMP_NUM_THREADS \
--hmm_cpu $OMP_NUM_THREADS \
--hotpep_cpu $OMP_NUM_THREADS \
--tf_cpu $OMP_NUM_THREADS \
--db_dir /home/dbcan2/db
cgc.gff cgc.out overview.txt prodigal.gff stp.out tf-1.out tf-2.out tp.out uniInput
As with the protein
method, cgc.out
and overview.txt
are empty. Running the prok
method without the -c $gff
option gives fewer files and the same empty overview.txt
.
The log files for both runs report zero errors. I've also attempted to run your program with the meta
parameter using three different inputs: (1) metagenomic contigs, (2) metagenomic genes (.fna) annotated using prokka, and (3) metagenomic proteins (.faa and .gff3) annotated using prokka. As with the E. coli test files, the output included empty files and no clear errors.
Concerning your comment above...
So strange. I used this command but I can have result in cgc.out
run_dbcan.py EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655
What is the purpose of your use of -c
with an .fna input? Your README seems to suggest -c
should only be invoked with the proteome
option:
[-c AuxillaryFile]- optional, include to enable CGCFinder. If using a proteome input, the AuxillaryFile must be a GFF or BED format file containing gene positioning information. Otherwise, the AuxillaryFile may be left blank.
I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command:
run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0
Finally I got my results. Hope this feedback is helpful.
@iaunicorn can you expand on what you mean by "each protein has the same name in the corresponding gff file"? Do you mean that each defline of the .faa file has a matching Name=
attribute in field 9 of the .gff file? Is it the ID=
attribute that needs to be changed, or protein_id=
? Could you give an example of the correct format for a .faa+.gff pair?
I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command: run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0 Finally I got my results. Hope this feedback is helpful.
@iaunicorn can you expand on what you mean by "each protein has the same name in the corresponding gff file"? Do you mean that each defline of the .faa file has a matching
Name=
attribute in field 9 of the .gff file? Is it theID=
attribute that needs to be changed, orprotein_id=
? Could you give an example of the correct format for a .faa+.gff pair?
@acvill Yes, I mean each defline of the .faa file has a matching Name=
attribute. Please see the following example:
TrA0997W MSLPVRTRDEGDDEGPPTSGLVDREEPSESLVGHEGRLEDGMLLRDFEERTTFYDAVAELQMTQTEAKLF ......
.gff file
scaffold_1 JGI exon 3056341 3056677 . + . Name=TrA0997W;transcriptId=110090 scaffold_1 JGI CDS 3056341 3056677 . + 0 Name=TrA0997W;proteinId=109692;exonNumber=1;product_name="" ......
I've found the issue with run_dbcan 2.0.11. Adding the optional -t all
parameter causes the program to break. Exclusion of -t
results in the correct output using the provided test files for both prok
and protein
inputs.
run_dbcan.py EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff --out_dir out_protein --db_dir /home/dbcan2/db
run_dbcan.py EscheriaColiK12MG1655.fna prok --out_dir out_prok --db_dir /home/dbcan2/db
run_dbcan.py EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff -t all --out_dir out_protein --db_dir /home/dbcan2/db
run_dbcan.py EscheriaColiK12MG1655.fna prok -t all --out_dir out_prok --db_dir /home/dbcan2/db
I will open a new issue referencing this thread.
Hello,
First of all, thank you for developing this awesome program.
I apologize for bringing up an old issue, but I am experiencing the same situation (empty cgc finder output) and have discovered additional information not mentioned above that I would like to share.
I am currently using run_dbcan
v4.1.4 and have observed that when using the --out_pre
flag, the code terminates with the cgc finder output (e.g. cgc.out
) being empty, regardless of whether the type of input is protein
or prok
.
If the --out_pre flag is not set, I have found that the cgc finder output files are generated even if the headers in the faa file and the IDs in the GFF file do not "perfectly" match.
For example, even if the faa header is c0001 Hypothetical protein
and the gff ID is c0001
, it works fine despite not being exactly the same.
If the --out_pre
flag is set, the below error messages are raised during the run_dbcan
dbCAN3 cgc finder out file: dbcan/cgc.out does not exist!
dbCAN3 dbsub output file: dbcan/dbcan-sub.hmm.out does not exist!
Substrate prediction based on major voting will not be applied!
dbCAN3 overview file: dbcan/overview.txt does not exist!
dbCAN3 protein sequences: dbcan/uniInput does not exist!
It seems that the prefix is not being properly applied. I tried adding the prefix in the relevant part of the cgc_substrate_prediction.py
script, but although this prevented the error messages from being printed, the cgc finder output files remained empty. Therefore, I suspect there might be some unintended behavior related to the --out_pre
flag beyond the error messages.
Currently, I am using the program without setting the --out_pre
flag and it works fine. Although I have not been able to test it extensively, I hope this information will be helpful for the further development of the program.
Thank you.
Hi there, thanks again for writing this useful tool!
I can run the CGC Finder without a problem with just fna files as input. However, when I tried providing my own protein sequences and the associated gff file, the CGC Finder output was blank. I tried using bed file as the [-c AuxillaryFile] and it still didn't work. I wonder if there is anything I am missing here.
Thank you!