linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
133 stars 40 forks source link

CGC-Finder only works with fna input #50

Open szzxpxf opened 4 years ago

szzxpxf commented 4 years ago

Hi there, thanks again for writing this useful tool!

I can run the CGC Finder without a problem with just fna files as input. However, when I tried providing my own protein sequences and the associated gff file, the CGC Finder output was blank. I tried using bed file as the [-c AuxillaryFile] and it still didn't work. I wonder if there is anything I am missing here.

Thank you!

lalalagartija commented 3 years ago

Same problem here. I tried with the E.coli example : run_dbcan.py --cluster E.coli.gff E.coli.faa protein --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.bed E.coli.faa protein --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.gff E.coli.fna meta --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.bed E.coli.fna meta --db_dir ~/Databases/db

and the cgc.out file always comes empty

FranckLejzerowicz commented 3 years ago

Hi, Along those above lines, I was wondering whether it is possible to obtain cgc.out based on a previous run_dbcan.py output consisting of only:

diamond.out
hmmer.out
overview.txt
prodigal.gff
uniInput

That is, would there be an easy way to run only the following without re-running DIAMOND and HMMER?:

run_dbcan.py uniInput protein --out_dir output --cluster prodigal.gff --db_dir ~/databases/dbCAN/db
linnabrown commented 3 years ago

I will check the stuffs during weekends. Thank you all for your feedback

FranckLejzerowicz commented 3 years ago

As for @lalalagartija, I end up with empty cgc.out. Here's what I did, with the latest version of run_dbcan.py:

run_dbcan.py contigs.fasta prok \
    --out_dir outdir \
    -t diamond hmmer \
    --db_dir ~/databases/dbCAN/db \
    --hmm_cpu 12 \
    --dia_cpu 12

Which produced the following files:

diamond.out
hmmer.out
overview.txt
prodigal.gff
uniInput

run_dbcan.py uniInput_1 protein \ --out_dir outdir \ --db_dir ~/databases/dbCAN/db \ --hmm_cpu 16 \ --dia_cpu 16 \ --cluster outdir/prodigal.gff


Which produced:

17M Sep 11 22:42 cgc.gff 0 Sep 11 22:42 cgc.out 500K Sep 11 22:42 diamond_1.out 500K Sep 11 22:42 diamond.out 587K Sep 11 22:42 hmmer_1.out 587K Sep 11 22:42 hmmer.out 1.3M Sep 11 22:42 Hotpep.out 421K Sep 11 22:42 overview_1.txt 505K Sep 11 22:42 overview.txt 68M Sep 11 22:42 prodigal.gff 2.5M Sep 11 22:42 stp.out 626K Sep 11 22:42 tf-1.out 501K Sep 11 22:42 tf-2.out 2.1M Sep 11 22:42 tp.out 75M Sep 11 22:42 uniInput 75M Sep 11 22:42 uniInput_1


See the empty `cgc.out` file.

Maybe it is wrong to re-run on the .gff file produced in the first run?

Thanks for the help and the great tool!
FranckLejzerowicz commented 3 years ago

Hi @linnabrown. Just wondering if you think you will have time to look at this. Otherwise, I would dig in the code and try to adapt a solution for my needs :) Thank you and hope things are going well for you.

linnabrown commented 3 years ago

So strange. I used this command but I can have result in cgc.out

run_dbcan.py EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655
linnabrown commented 3 years ago

And I don't have this issue like yours when protein data is input.

linnabrown commented 3 years ago

Please try this command and let me know

python CGCFinder.py output_EscheriaColiK12MG1655/cgc.gff -o cgc.out -s tp -d 2
iaunicorn commented 3 years ago

I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command:

run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0

Finally I got my results. Hope this feedback is helpful.

FranckLejzerowicz commented 3 years ago

@iaunicorn Thank you for the hint!

Indeed, in my .gff file the sequences IDs are labeled as such

NODE_1_length_998062_cov_18.990727  Prodigal_v2.6.3 CDS 30  830 52.0    +   0   ;ID=1_1

whereas these IDs in the fasta file, the proteins are labeled with the ID# appended:

>NODE_1_length_998062_cov_18.990727_1 # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366

Did you edit your proteins.faa input so that the fasta header match exactly the .gff entries? i.e. in my case, leaving only >NODE_1_length_998062_cov_18.990727_1 (and without # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366 )? Prefer asking in advance as my input is very large... Thanks!

@linnabrown I did not get the chance to run the test commands you shared but I will to verify that my issue could be the proteins naming. Thanks again!

linnabrown commented 3 years ago

Hi,

Thanks for using our tool. Our gff file should be gff3 format.

Best, Le

From: Lejzerowicz notifications@github.com Date: Monday, November 23, 2020 at 1:03 PM To: linnabrown/run_dbcan run_dbcan@noreply.github.com Cc: Huang, Le lehuang@email.unc.edu, Mention mention@noreply.github.com Subject: Re: [linnabrown/run_dbcan] CGC-Finder only works with fna input (#50)

@iaunicornhttps://github.com/iaunicorn Thank you for the hint!

Indeed, in my .gff file the sequences IDs are labeled as such

NODE_1_length_998062_cov_18.990727 Prodigal_v2.6.3 CDS 30 830 52.0 + 0 ;ID=1_1

whereas these IDs in the fasta file, the proteins are labeled with the ID# appended:

NODE_1_length_998062_cov_18.990727_1 # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366

Did you edit your proteins.faa input so that the fasta header match exactly the .gff entries? i.e. in my case, leaving only >NODE_1_length_998062_cov_18.990727_1 (and without # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366 )? Prefer asking in advance as my input is very large... Thanks!

@linnabrownhttps://github.com/linnabrown I did not get the chance to run the test commands you shared but I will to verify that my issue could be the proteins naming. Thanks again!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/50#issuecomment-732330139, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMHALVIHR5TAX6JFC7AECDSRKPX5ANCNFSM4OHAQIQA.

FranckLejzerowicz commented 3 years ago

OK, it seems I have been using Prodigal V2.6.3, while GFF3 might only be an output of Prodigal.v3.0.0... Weird because I created a new conda env with it, following the install procedure conda create -n run_dbcan python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda.

I can look it up, but do you know whether it is possible to convert gff to gff3 format? If yes, do you know of some scripts doing this?

Thanks.

linnabrown commented 3 years ago

http://www.sequenceontology.org/cgi-bin/converter.cgi

From: Lejzerowicz notifications@github.com Date: Monday, November 23, 2020 at 1:17 PM To: linnabrown/run_dbcan run_dbcan@noreply.github.com Cc: Huang, Le lehuang@email.unc.edu, Mention mention@noreply.github.com Subject: Re: [linnabrown/run_dbcan] CGC-Finder only works with fna input (#50)

OK, it seems I have been using Prodigal V2.6.3, while GFF3 might only be an output of Prodigal.v3.0.0... Weird because I created a new conda env with it, following the install procedure conda create -n run_dbcan python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda.

I can look it up, but do you know whether it is possible to convert gff to gff3 format? If yes, do you know of some scripts doing this?

Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/50#issuecomment-732337997, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMHALSFQZWBFKWRU7E6L2DSRKRKNANCNFSM4OHAQIQA.

FranckLejzerowicz commented 3 years ago

Answering my own question. To convert GFF to GFF3;

https://bmi.inf.ethz.ch/supplements/gff-tools

Best, F

iaunicorn commented 3 years ago

@FranckLejzerowicz Yes, I edited my proteins.faa input.

acvill commented 2 years ago

Edit

See issue #78


Hello @linnabrown. Sorry to resurrect this old issue. I'm having the same problems with run_dbcan v.2.0.11, installed via conda following your instructions. Here are my commands and outputs:

installation

source /home/miniconda3/bin/activate
conda create -n dbcan2 python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda
conda activate dbcan2
pip install run-dbcan==2.0.11

Databases were installed as outlined in the README.

.faa & .gff test

export OMP_NUM_THREADS=8
faa=/home/dbcan2/EscheriaColiK12MG1655.faa
gff=/home/dbcan2/EscheriaColiK12MG1655.gff
OUT=/workdir/dbcan2
mkdir -p $OUT
cd $OUT

source /home/miniconda3/bin/activate
conda activate dbcan2

run_dbcan.py $faa protein \
       --out_dir test_faa \
       -c $gff \
       -t all \
       --dia_cpu $OMP_NUM_THREADS \
       --hmm_cpu $OMP_NUM_THREADS \
       --hotpep_cpu $OMP_NUM_THREADS \
       --tf_cpu $OMP_NUM_THREADS \
       --db_dir /home/dbcan2/db

output

cgc.gff  cgc.out  overview.txt  stp.out  tf-1.out  tf-2.out  tp.out  uniInput

cgc.out is empty, and overview.txt only includes column names.

.fna test

export OMP_NUM_THREADS=8
fna=/home/dbcan2/EscheriaColiK12MG1655.fna
gff=/home/dbcan2/EscheriaColiK12MG1655.gff
OUT=/workdir/dbcan2
mkdir -p $OUT
cd $OUT

source /home/miniconda3/bin/activate
conda activate dbcan2

run_dbcan.py $fna prok \
       --out_dir test_fna \
       -c $gff \
       -t all \
       --dia_cpu $OMP_NUM_THREADS \
       --hmm_cpu $OMP_NUM_THREADS \
       --hotpep_cpu $OMP_NUM_THREADS \
       --tf_cpu $OMP_NUM_THREADS \
       --db_dir /home/dbcan2/db

output

cgc.gff  cgc.out  overview.txt  prodigal.gff  stp.out  tf-1.out  tf-2.out  tp.out  uniInput

As with the protein method, cgc.out and overview.txt are empty. Running the prok method without the -c $gff option gives fewer files and the same empty overview.txt.

The log files for both runs report zero errors. I've also attempted to run your program with the meta parameter using three different inputs: (1) metagenomic contigs, (2) metagenomic genes (.fna) annotated using prokka, and (3) metagenomic proteins (.faa and .gff3) annotated using prokka. As with the E. coli test files, the output included empty files and no clear errors.


Concerning your comment above...

So strange. I used this command but I can have result in cgc.out

run_dbcan.py EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655

What is the purpose of your use of -c with an .fna input? Your README seems to suggest -c should only be invoked with the proteome option:

[-c AuxillaryFile]- optional, include to enable CGCFinder. If using a proteome input, the AuxillaryFile must be a GFF or BED format file containing gene positioning information. Otherwise, the AuxillaryFile may be left blank.

acvill commented 2 years ago

I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command:

run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0

Finally I got my results. Hope this feedback is helpful.

@iaunicorn can you expand on what you mean by "each protein has the same name in the corresponding gff file"? Do you mean that each defline of the .faa file has a matching Name= attribute in field 9 of the .gff file? Is it the ID= attribute that needs to be changed, or protein_id=? Could you give an example of the correct format for a .faa+.gff pair?

iaunicorn commented 2 years ago

I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command: run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0 Finally I got my results. Hope this feedback is helpful.

@iaunicorn can you expand on what you mean by "each protein has the same name in the corresponding gff file"? Do you mean that each defline of the .faa file has a matching Name= attribute in field 9 of the .gff file? Is it the ID= attribute that needs to be changed, or protein_id=? Could you give an example of the correct format for a .faa+.gff pair?

@acvill Yes, I mean each defline of the .faa file has a matching Name= attribute. Please see the following example:

.faa file

TrA0997W MSLPVRTRDEGDDEGPPTSGLVDREEPSESLVGHEGRLEDGMLLRDFEERTTFYDAVAELQMTQTEAKLF ......

.gff file

scaffold_1 JGI exon 3056341 3056677 . + . Name=TrA0997W;transcriptId=110090 scaffold_1 JGI CDS 3056341 3056677 . + 0 Name=TrA0997W;proteinId=109692;exonNumber=1;product_name="" ......

acvill commented 2 years ago

I've found the issue with run_dbcan 2.0.11. Adding the optional -t all parameter causes the program to break. Exclusion of -t results in the correct output using the provided test files for both prok and protein inputs.

Works

run_dbcan.py EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff --out_dir out_protein --db_dir /home/dbcan2/db
run_dbcan.py EscheriaColiK12MG1655.fna prok --out_dir out_prok --db_dir /home/dbcan2/db

Does not work

run_dbcan.py EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff -t all --out_dir out_protein --db_dir /home/dbcan2/db
run_dbcan.py EscheriaColiK12MG1655.fna prok -t all --out_dir out_prok --db_dir /home/dbcan2/db

I will open a new issue referencing this thread.

jylee3247 commented 2 weeks ago

Hello,

First of all, thank you for developing this awesome program.

I apologize for bringing up an old issue, but I am experiencing the same situation (empty cgc finder output) and have discovered additional information not mentioned above that I would like to share.

I am currently using run_dbcan v4.1.4 and have observed that when using the --out_pre flag, the code terminates with the cgc finder output (e.g. cgc.out) being empty, regardless of whether the type of input is protein or prok.

If the --out_pre flag is not set, I have found that the cgc finder output files are generated even if the headers in the faa file and the IDs in the GFF file do not "perfectly" match. For example, even if the faa header is c0001 Hypothetical protein and the gff ID is c0001, it works fine despite not being exactly the same.

If the --out_pre flag is set, the below error messages are raised during the run_dbcan

dbCAN3 cgc finder out file: dbcan/cgc.out does not exist!
dbCAN3 dbsub output file: dbcan/dbcan-sub.hmm.out does not exist!
Substrate prediction based on major voting will not be applied!
dbCAN3 overview file: dbcan/overview.txt does not exist!
dbCAN3 protein sequences: dbcan/uniInput does not exist!

It seems that the prefix is not being properly applied. I tried adding the prefix in the relevant part of the cgc_substrate_prediction.py script, but although this prevented the error messages from being printed, the cgc finder output files remained empty. Therefore, I suspect there might be some unintended behavior related to the --out_pre flag beyond the error messages.

Currently, I am using the program without setting the --out_pre flag and it works fine. Although I have not been able to test it extensively, I hope this information will be helpful for the further development of the program.

Thank you.