linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
130 stars 40 forks source link

dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error #179

Open Ben-41 opened 2 weeks ago

Ben-41 commented 2 weeks ago

Report

hi, I have encounter issues with the estimation of CGC substrate abundance and CGC abundance. I followed all the steps from the manual and it ran smoothly, including dbcan_utils fam_abund and dbcan_utils fam_substrate_abund, however, when I ran dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund, error raise:

You are estimating the abundance of CGC/CGC substrate! Reads are single end! Total reads count: 218847! Traceback (most recent call last): File "/home/cdd/anaconda3/envs/dbcan/bin/dbcan_utils", line 10, in sys.exit(main()) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 621, in main PUL_abundance(args) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 492, in PUL_abundance PUL_abund = CAZyme_Abundance_estimate(paras) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 254, in init seqid2dbcan_annotation,cgcid2cgc_standard = Read_cgc_standard_out(parameters.PUL_annotation) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 203, in Read_cgc_standard_out tmp_record = cgc_standard_line(line.rstrip().split("\t")) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 191, in init self.gene_start = int(lines[4]) ValueError: invalid literal for int() with base 10: 'Gene Start'

Version information

No response

ZhengJinfang1220 commented 2 weeks ago

Report

hi, I have encounter issues with the estimation of CGC substrate abundance and CGC abundance. I followed all the steps from the manual and it ran smoothly, including dbcan_utils fam_abund and dbcan_utils fam_substrate_abund, however, when I ran dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund, error raise:

You are estimating the abundance of CGC/CGC substrate! Reads are single end! Total reads count: 218847! Traceback (most recent call last): File "/home/cdd/anaconda3/envs/dbcan/bin/dbcan_utils", line 10, in sys.exit(main()) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 621, in main PUL_abundance(args) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 492, in PUL_abundance PUL_abund = CAZyme_Abundance_estimate(paras) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 254, in init seqid2dbcan_annotation,cgcid2cgc_standard = Read_cgc_standard_out(parameters.PUL_annotation) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 203, in Read_cgc_standard_out tmp_record = cgc_standard_line(line.rstrip().split("\t")) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 191, in init self.gene_start = int(lines[4]) ValueError: invalid literal for int() with base 10: 'Gene Start'

Version information

No response

It seems the issue happens when the script reads the file "cgc_standard.out". Can you share this file here? So I can debug the code.

Jinfang

Ben-41 commented 2 weeks ago

Hi Jinfang, the cgc_standard.out looks like this: $ head cgc_standard.out CGC# Gene Type Contig ID Protein ID Gene Start Gene Stop Direction Protein Family CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_47 44478 45509 - 1.A.33.1.5 CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_48 45726 48518 + GH2|GH2_e50 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_65 72550 73521 - SIS+CBS+CBS CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_71 77847 79862 - GH36|GH36_e10 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_76 87219 88157 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_80 89523 91055 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_81 91077 91949 + 3.A.2.1.1 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_82 92024 93445 + 3.A.2.1.2 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_87 96421 97605 + Aminotran_1_2

During the prediction of CGCs, the manual said I need to have my own gff file, so I modified the gff file from Prodigal output, which change : 1 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 3 449 62.4 + 0 ID=0_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.651;conf=100.00;score=62.40;cscore=59.18;sscore=3.22;rsco re=0.00;uscore=0.00;tscore=3.22; 2 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 658 1245 11.3 - 0 ID=0_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.505;conf=93.04;score=11.28;cscore=6.79;sscore=4.49;rsc ore=1.54;uscore=0.03;tscore=3.57; to 1 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 3 449 62.4 + 0 ID=Group_1_2_bin.20_contig-100_0_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.651;conf=100.00;score=62.40;cscore=59.18;sscore=3.22;rsco re=0.00;uscore=0.00;tscore=3.22; 2 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 658 1245 11.3 - 0 ID=Group_1_2_bin.20_contig-100_0_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.505;conf=93.04;score=11.28;cscore=6.79;sscore=4.49;rsc ore=1.54;uscore=0.03;tscore=3.57;

I don't know if this is wrong.

ZhengJinfang1220 commented 2 weeks ago

Hi Jinfang, the cgc_standard.out looks like this: $ head cgc_standard.out CGC# Gene Type Contig ID Protein ID Gene Start Gene Stop Direction Protein Family CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_47 44478 45509 - 1.A.33.1.5 CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_48 45726 48518 + GH2|GH2_e50 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_65 72550 73521 - SIS+CBS+CBS CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_71 77847 79862 - GH36|GH36_e10 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_76 87219 88157 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_80 89523 91055 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_81 91077 91949 + 3.A.2.1.1 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_82 92024 93445 + 3.A.2.1.2 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_87 96421 97605 + Aminotran_1_2

Thank you, Ben, It looks normal in your input file. Can you check the function "Read_cgc_standard_out", this function should defined at line 199 in file "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py". The code in line 201 in the same file means to open and read the file "cgc_standard.out" by line. It should skip the 1st line which is the header. Like in the following screenshot. Can you check what these are like in your code?

20240619094256
ZhengJinfang1220 commented 2 weeks ago

Group_1_2_bin.20_contig-100_0_47

Yes, you did the correct modification on gff file. And you got the output file "cgc_standard.out". Otherwise, you can not get this output.

Ben-41 commented 2 weeks ago
image
ZhengJinfang1220 commented 2 weeks ago

'Gene Start

It seems the codes also look normal. So, what happens? Could you check the input file again, to look for another string "Gene Start" except for 1st line? If this still does not work.

Can you send me all the input files? I will debug on my PC. Here is my email: zhengjinfang1220@gamil.com

powerby66 commented 2 weeks ago

Hello,i want to ask you some questions please, dbcan_utils fam_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPMoutput/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM dbcan_utils CGC_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM dbcan_utils CGC_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM bash: dbcan_utils: command not found... (/data2/jpc/env/dbcan414) jpc 18:09:17 ~/project/db_test/output/EscheriaColiK12MG1655_abund dbcan_utils fam_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM bash: dbcan_utils: command not found... (/data2/jpc/env/dbcan414) jpc 18:09:18 ~/project/db_test/output/EscheriaColiK12MG1655_abund dbcan_utils CGC_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPMbash: dbcan_utils: command not found... (/data2/jpc/env/dbcan414) jpc 18:09:18 ~/project/db_test/output/EscheriaColiK12MG1655_abund dbcan_utils CGC_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM bash: dbcan_utils: command not found... but i cant find the utils.py in /data2/jpc/env/dbcan414/lib/python3.7/site-packages/dbcan/utils,how can i solve this question? 1718964724642

PaolaDiGianvito commented 1 week ago

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me?

powerby66 commented 4 days ago

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me?

Hello,Have you successfully solved this problem?I met the question too

PaolaDiGianvito commented 4 days ago

Not yet, really. I have used metaeuk for genes prediction. Could be this the problem?

Il Lun 1 Lug 2024, 11:11 powerby66 @.***> ha scritto:

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me?

Hello,Have you successfully solved this problem?I met the question too

— Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2199634213, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCGZWCPNEMRKC3EXOXMLZKEMMFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZGYZTIMRRGM . You are receiving this because you commented.Message ID: @.***>

ZhengJinfang1220 commented 3 days ago

Not yet, really. I have used metaeuk for genes prediction. Could be this the problem? Il Lun 1 Lug 2024, 11:11 powerby66 @.> ha scritto: Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me? Hello,Have you successfully solved this problem?I met the question too — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCGZWCPNEMRKC3EXOXMLZKEMMFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZGYZTIMRRGM . You are receiving this because you commented.Message ID: @.>

Not yet, really. I have used metaeuk for genes prediction. Could be this the problem? Il Lun 1 Lug 2024, 11:11 powerby66 @.> ha scritto: Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me? Hello,Have you successfully solved this problem?I met the question too — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCGZWCPNEMRKC3EXOXMLZKEMMFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZGYZTIMRRGM . You are receiving this because you commented.Message ID: @.>

Hi, guys. We have fixed the bug in the updated version of dbCAN(several months ago). If still use the older version. please follow the steps: 1 please check "line 93, utils.py", the definition of ReadBedtoos 2 please modify line 95: seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[1:]} to seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[0:]}, a bug ignoring the first line of read count information.

PaolaDiGianvito commented 3 days ago

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

yinlabniu commented 3 days ago

Paola,

We do not recommend using run_dbcan for CGC prediction and CGC-based abundance profiling. The reason is that the CGC/PUL concept does not exist in eukaryotes. The gff generated from metaeuk contains exons which will be wrongly treated as separate CDS/genes in run_dbcan. But, you can still use run_dbcan for CAZyme predictions and CAZyme-based abundance profiling, as no gff file will be used.

Yanbin


From: Paola88 @.> Sent: Monday, July 1, 2024 9:24 AM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

— Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200299114, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ. You are receiving this because you are subscribed to this thread.

PaolaDiGianvito commented 3 days ago

Thank you for your answer. If i humderstand i have to do the steps p5 and after p9 in the tutorial, is it right?

Il Lun 1 Lug 2024, 16:45 Yanbin Yin @.***> ha scritto:

Paola,

We do not recommend using run_dbcan for CGC prediction and CGC-based abundance profiling. The reason is that the CGC/PUL concept does not exist in eukaryotes. The gff generated from metaeuk contains exons which will be wrongly treated as separate CDS/genes in run_dbcan. But, you can still use run_dbcan for CAZyme predictions and CAZyme-based abundance profiling, as no gff file will be used.

Yanbin


From: Paola88 @.> Sent: Monday, July 1, 2024 9:24 AM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

— Reply to this email directly, view it on GitHub< https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200299114>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ>.

You are receiving this because you are subscribed to this thread.

— Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200357598, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCG4ZUQ44P6QVQJVT4WTZKFTSLAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGM2TONJZHA . You are receiving this because you commented.Message ID: @.***>

PaolaDiGianvito commented 3 days ago

Sorry for the question, bit if I don't generate files.ffn how can I estimate abundance, if i hunderstand, i need the depth file. Thank you

Il Lun 1 Lug 2024, 17:38 Paola Di Gianvito @.***> ha scritto:

Thank you for your answer. If i humderstand i have to do the steps p5 and after p9 in the tutorial, is it right?

Il Lun 1 Lug 2024, 16:45 Yanbin Yin @.***> ha scritto:

Paola,

We do not recommend using run_dbcan for CGC prediction and CGC-based abundance profiling. The reason is that the CGC/PUL concept does not exist in eukaryotes. The gff generated from metaeuk contains exons which will be wrongly treated as separate CDS/genes in run_dbcan. But, you can still use run_dbcan for CAZyme predictions and CAZyme-based abundance profiling, as no gff file will be used.

Yanbin


From: Paola88 @.> Sent: Monday, July 1, 2024 9:24 AM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

— Reply to this email directly, view it on GitHub< https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200299114>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ>.

You are receiving this because you are subscribed to this thread.

— Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200357598, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCG4ZUQ44P6QVQJVT4WTZKFTSLAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGM2TONJZHA . You are receiving this because you commented.Message ID: @.***>

yinlabniu commented 3 days ago

That's right. For CAZyme-based abundance profiling, you only need to predict CAZymes (provide your own faa in p5), and you need ffn in p8 and p11. Any processes using contigs can be skipped.


From: Paola88 @.> Sent: Monday, July 1, 2024 11:31 AM To: linnabrown/run_dbcan @.> Cc: Yanbin Yin @.>; Comment @.> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Sorry for the question, bit if I don't generate files.ffn how can I estimate abundance, if i hunderstand, i need the depth file. Thank you

Il Lun 1 Lug 2024, 17:38 Paola Di Gianvito @.***> ha scritto:

Thank you for your answer. If i humderstand i have to do the steps p5 and after p9 in the tutorial, is it right?

Il Lun 1 Lug 2024, 16:45 Yanbin Yin @.***> ha scritto:

Paola,

We do not recommend using run_dbcan for CGC prediction and CGC-based abundance profiling. The reason is that the CGC/PUL concept does not exist in eukaryotes. The gff generated from metaeuk contains exons which will be wrongly treated as separate CDS/genes in run_dbcan. But, you can still use run_dbcan for CAZyme predictions and CAZyme-based abundance profiling, as no gff file will be used.

Yanbin


From: Paola88 @.> Sent: Monday, July 1, 2024 9:24 AM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

— Reply to this email directly, view it on GitHub< https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200299114>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ>.

You are receiving this because you are subscribed to this thread.

— Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200357598, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCG4ZUQ44P6QVQJVT4WTZKFTSLAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGM2TONJZHA . You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200587704, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXNKZRH75FTSOGCUWDQI6TZKF76TAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGU4DONZQGQ. You are receiving this because you commented.

PaolaDiGianvito commented 2 days ago

Hi, i have tried as you suggested to me, but i write you another time because it doesn't work. these ar3e my steps: I have shotgun metagenomic data during wine fermentation and i have done the gene prediction with metaeuk, i have done the steps p5, p8 (after duplication removing), p10 and p11.

At step 13 i have this new error : dbcan_utils fam_abund -bt GC1_D2.depth.txt -i /home/pdigianv/ita_gre/CAZyme/GC1_D2.CAZyme -a TPM You are estimating the abundance of CAZyme! Reads are single end! Total reads count: 549495! Can not find read count information for CAZyme: AA1.aln|k141_836|-|195|8.216e-54|1|149518|151017|151017[151017]:149518[149518]:1500[1500]

even if i have modified the utyls.py as suggested. Can you help me?

Paola Di Gianvito, PhD Tecnologo della ricerca, DISAFA, University of Turin Agricultural Microbiology and Food Technology Sector

Corso Enotria 2/C, Ampelion 12051 Alba - Cuneo - ITALY

Il giorno lun 1 lug 2024 alle ore 18:44 Yanbin Yin @.***> ha scritto:

That's right. For CAZyme-based abundance profiling, you only need to predict CAZymes (provide your own faa in p5), and you need ffn in p8 and p11. Any processes using contigs can be skipped.


From: Paola88 @.> Sent: Monday, July 1, 2024 11:31 AM To: linnabrown/run_dbcan @.> Cc: Yanbin Yin @.>; Comment @.> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Sorry for the question, bit if I don't generate files.ffn how can I estimate abundance, if i hunderstand, i need the depth file. Thank you

Il Lun 1 Lug 2024, 17:38 Paola Di Gianvito @.***> ha scritto:

Thank you for your answer. If i humderstand i have to do the steps p5 and after p9 in the tutorial, is it right?

Il Lun 1 Lug 2024, 16:45 Yanbin Yin @.***> ha scritto:

Paola,

We do not recommend using run_dbcan for CGC prediction and CGC-based abundance profiling. The reason is that the CGC/PUL concept does not exist in eukaryotes. The gff generated from metaeuk contains exons which will be wrongly treated as separate CDS/genes in run_dbcan. But, you can still use run_dbcan for CAZyme predictions and CAZyme-based abundance profiling, as no gff file will be used.

Yanbin


From: Paola88 @.> Sent: Monday, July 1, 2024 9:24 AM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

— Reply to this email directly, view it on GitHub<

https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200299114>,

or unsubscribe<

https://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ>.

You are receiving this because you are subscribed to this thread.

— Reply to this email directly, view it on GitHub < https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200357598>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/BBFTCG4ZUQ44P6QVQJVT4WTZKFTSLAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGM2TONJZHA>

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub< https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200587704>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AEXNKZRH75FTSOGCUWDQI6TZKF76TAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGU4DONZQGQ>.

You are receiving this because you commented.

— Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200608374, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCG74KTJYIQS527LHKIDZKGBPFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGYYDQMZXGQ . You are receiving this because you commented.Message ID: @.***>

PaolaDiGianvito commented 1 day ago

Good morning, i have tried to run my data from contigs of megahit avoiding metaeuk step and using file.fna and meta for CAZyme annotation, followed by ffn generation from prodigal gffr files and stp 8 and 11. At sep 13 I have ever this error dbcan_utils fam_abund -bt GC1_D2.depth.txt -i ../GC1_D2.CAZyme -a TPM You are estimating the abundance of CAZyme! Reads are single end! Total reads count: 43824341! Can not find read count information for CAZyme: k141_111_4 ... I have changed line 95 in utils.py as suggested def ReadBedtoos(filename): lines = open(filename).readlines() seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[0:]} normalized_tpm = 0. for seqid in seqid2info: seqid_depth = seqid2info[seqid] normalized_tpm += seqid_depth.read_count/seqid_depth.length return seqid2info,normalized_tpm

Can you help me?