dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error

linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.

http://bcb.unl.edu/dbCAN2

GNU General Public License v3.0

146 stars 39 forks source link

dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error #179

Open Ben-41 opened 5 months ago

Ben-41 commented 5 months ago

Report

hi, I have encounter issues with the estimation of CGC substrate abundance and CGC abundance. I followed all the steps from the manual and it ran smoothly, including dbcan_utils fam_abund and dbcan_utils fam_substrate_abund, however, when I ran dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund, error raise:

You are estimating the abundance of CGC/CGC substrate! Reads are single end! Total reads count: 218847! Traceback (most recent call last): File "/home/cdd/anaconda3/envs/dbcan/bin/dbcan_utils", line 10, in sys.exit(main()) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 621, in main PUL_abundance(args) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 492, in PUL_abundance PUL_abund = CAZyme_Abundance_estimate(paras) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 254, in init seqid2dbcan_annotation,cgcid2cgc_standard = Read_cgc_standard_out(parameters.PUL_annotation) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 203, in Read_cgc_standard_out tmp_record = cgc_standard_line(line.rstrip().split("\t")) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 191, in init self.gene_start = int(lines[4]) ValueError: invalid literal for int() with base 10: 'Gene Start'

Version information

No response

ZhengJinfang1220 commented 5 months ago

Report

hi, I have encounter issues with the estimation of CGC substrate abundance and CGC abundance. I followed all the steps from the manual and it ran smoothly, including dbcan_utils fam_abund and dbcan_utils fam_substrate_abund, however, when I ran dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund, error raise:

You are estimating the abundance of CGC/CGC substrate! Reads are single end! Total reads count: 218847! Traceback (most recent call last): File "/home/cdd/anaconda3/envs/dbcan/bin/dbcan_utils", line 10, in sys.exit(main()) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 621, in main PUL_abundance(args) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 492, in PUL_abundance PUL_abund = CAZyme_Abundance_estimate(paras) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 254, in init seqid2dbcan_annotation,cgcid2cgc_standard = Read_cgc_standard_out(parameters.PUL_annotation) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 203, in Read_cgc_standard_out tmp_record = cgc_standard_line(line.rstrip().split("\t")) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 191, in init self.gene_start = int(lines[4]) ValueError: invalid literal for int() with base 10: 'Gene Start'

Version information

No response

It seems the issue happens when the script reads the file "cgc_standard.out". Can you share this file here? So I can debug the code.

Jinfang

Ben-41 commented 5 months ago

Hi Jinfang, the cgc_standard.out looks like this: $ head cgc_standard.out CGC# Gene Type Contig ID Protein ID Gene Start Gene Stop Direction Protein Family CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_47 44478 45509 - 1.A.33.1.5 CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_48 45726 48518 + GH2|GH2_e50 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_65 72550 73521 - SIS+CBS+CBS CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_71 77847 79862 - GH36|GH36_e10 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_76 87219 88157 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_80 89523 91055 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_81 91077 91949 + 3.A.2.1.1 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_82 92024 93445 + 3.A.2.1.2 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_87 96421 97605 + Aminotran_1_2

During the prediction of CGCs, the manual said I need to have my own gff file, so I modified the gff file from Prodigal output, which change : 1 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 3 449 62.4 + 0 ID=0_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.651;conf=100.00;score=62.40;cscore=59.18;sscore=3.22;rsco re=0.00;uscore=0.00;tscore=3.22; 2 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 658 1245 11.3 - 0 ID=0_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.505;conf=93.04;score=11.28;cscore=6.79;sscore=4.49;rsc ore=1.54;uscore=0.03;tscore=3.57; to 1 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 3 449 62.4 + 0 ID=Group_1_2_bin.20_contig-100_0_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.651;conf=100.00;score=62.40;cscore=59.18;sscore=3.22;rsco re=0.00;uscore=0.00;tscore=3.22; 2 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 658 1245 11.3 - 0 ID=Group_1_2_bin.20_contig-100_0_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.505;conf=93.04;score=11.28;cscore=6.79;sscore=4.49;rsc ore=1.54;uscore=0.03;tscore=3.57;

I don't know if this is wrong.

ZhengJinfang1220 commented 5 months ago

Hi Jinfang, the cgc_standard.out looks like this: $ head cgc_standard.out CGC# Gene Type Contig ID Protein ID Gene Start Gene Stop Direction Protein Family CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_47 44478 45509 - 1.A.33.1.5 CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_48 45726 48518 + GH2|GH2_e50 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_65 72550 73521 - SIS+CBS+CBS CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_71 77847 79862 - GH36|GH36_e10 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_76 87219 88157 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_80 89523 91055 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_81 91077 91949 + 3.A.2.1.1 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_82 92024 93445 + 3.A.2.1.2 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_87 96421 97605 + Aminotran_1_2

Thank you, Ben, It looks normal in your input file. Can you check the function "Read_cgc_standard_out", this function should defined at line 199 in file "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py". The code in line 201 in the same file means to open and read the file "cgc_standard.out" by line. It should skip the 1st line which is the header. Like in the following screenshot. Can you check what these are like in your code?

ZhengJinfang1220 commented 5 months ago

Group_1_2_bin.20_contig-100_0_47

Yes, you did the correct modification on gff file. And you got the output file "cgc_standard.out". Otherwise, you can not get this output.

Ben-41 commented 5 months ago

ZhengJinfang1220 commented 5 months ago

'Gene Start

It seems the codes also look normal. So, what happens? Could you check the input file again, to look for another string "Gene Start" except for 1st line? If this still does not work.

Can you send me all the input files? I will debug on my PC. Here is my email: zhengjinfang1220@gamil.com

powerby66 commented 5 months ago

Hello,i want to ask you some questions please, dbcan_utils fam_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPMoutput/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM dbcan_utils CGC_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM dbcan_utils CGC_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM bash: dbcan_utils: command not found... (/data2/jpc/env/dbcan414) jpc 18:09:17 ~/project/db_test/output/EscheriaColiK12MG1655_abund dbcan_utils fam_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM bash: dbcan_utils: command not found... (/data2/jpc/env/dbcan414) jpc 18:09:18 ~/project/db_test/output/EscheriaColiK12MG1655_abund dbcan_utils CGC_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPMbash: dbcan_utils: command not found... (/data2/jpc/env/dbcan414) jpc 18:09:18 ~/project/db_test/output/EscheriaColiK12MG1655_abund dbcan_utils CGC_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM bash: dbcan_utils: command not found... but i cant find the utils.py in /data2/jpc/env/dbcan414/lib/python3.7/site-packages/dbcan/utils,how can i solve this question? 1718964724642

PaolaDiGianvito commented 5 months ago

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me?

powerby66 commented 4 months ago

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me?

Hello,Have you successfully solved this problem?I met the question too

PaolaDiGianvito commented 4 months ago

Not yet, really. I have used metaeuk for genes prediction. Could be this the problem?

Il Lun 1 Lug 2024, 11:11 powerby66 @.***> ha scritto:

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me?

Hello,Have you successfully solved this problem?I met the question too

— Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2199634213, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCGZWCPNEMRKC3EXOXMLZKEMMFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZGYZTIMRRGM . You are receiving this because you commented.Message ID: @.***>

ZhengJinfang1220 commented 4 months ago

Not yet, really. I have used metaeuk for genes prediction. Could be this the problem? Il Lun 1 Lug 2024, 11:11 powerby66 @.> ha scritto: … Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me? Hello,Have you successfully solved this problem?I met the question too — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCGZWCPNEMRKC3EXOXMLZKEMMFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZGYZTIMRRGM . You are receiving this because you commented.Message ID: @.>

Not yet, really. I have used metaeuk for genes prediction. Could be this the problem? Il Lun 1 Lug 2024, 11:11 powerby66 @.> ha scritto: … Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me? Hello,Have you successfully solved this problem?I met the question too — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCGZWCPNEMRKC3EXOXMLZKEMMFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZGYZTIMRRGM . You are receiving this because you commented.Message ID: @.>

Hi, guys. We have fixed the bug in the updated version of dbCAN(several months ago). If still use the older version. please follow the steps: 1 please check "line 93, utils.py", the definition of ReadBedtoos 2 please modify line 95: seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[1:]} to seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[0:]}, a bug ignoring the first line of read count information.

PaolaDiGianvito commented 4 months ago

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

yinlabniu commented 4 months ago

Paola,

We do not recommend using run_dbcan for CGC prediction and CGC-based abundance profiling. The reason is that the CGC/PUL concept does not exist in eukaryotes. The gff generated from metaeuk contains exons which will be wrongly treated as separate CDS/genes in run_dbcan. But, you can still use run_dbcan for CAZyme predictions and CAZyme-based abundance profiling, as no gff file will be used.

Yanbin

From: Paola88 @.> Sent: Monday, July 1, 2024 9:24 AM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

— Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200299114, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ. You are receiving this because you are subscribed to this thread.

PaolaDiGianvito commented 4 months ago

Thank you for your answer. If i humderstand i have to do the steps p5 and after p9 in the tutorial, is it right?

Il Lun 1 Lug 2024, 16:45 Yanbin Yin @.***> ha scritto:

Paola,

We do not recommend using run_dbcan for CGC prediction and CGC-based abundance profiling. The reason is that the CGC/PUL concept does not exist in eukaryotes. The gff generated from metaeuk contains exons which will be wrongly treated as separate CDS/genes in run_dbcan. But, you can still use run_dbcan for CAZyme predictions and CAZyme-based abundance profiling, as no gff file will be used.

Yanbin

From: Paola88 @.> Sent: Monday, July 1, 2024 9:24 AM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

— Reply to this email directly, view it on GitHub< https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200299114>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ>.

You are receiving this because you are subscribed to this thread.

— Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200357598, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCG4ZUQ44P6QVQJVT4WTZKFTSLAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGM2TONJZHA . You are receiving this because you commented.Message ID: @.***>

PaolaDiGianvito commented 4 months ago

Sorry for the question, bit if I don't generate files.ffn how can I estimate abundance, if i hunderstand, i need the depth file. Thank you

Il Lun 1 Lug 2024, 17:38 Paola Di Gianvito @.***> ha scritto:

Thank you for your answer. If i humderstand i have to do the steps p5 and after p9 in the tutorial, is it right?

Il Lun 1 Lug 2024, 16:45 Yanbin Yin @.***> ha scritto:

Paola,

We do not recommend using run_dbcan for CGC prediction and CGC-based abundance profiling. The reason is that the CGC/PUL concept does not exist in eukaryotes. The gff generated from metaeuk contains exons which will be wrongly treated as separate CDS/genes in run_dbcan. But, you can still use run_dbcan for CAZyme predictions and CAZyme-based abundance profiling, as no gff file will be used.

Yanbin

From: Paola88 @.> Sent: Monday, July 1, 2024 9:24 AM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

— Reply to this email directly, view it on GitHub< https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200299114>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ>.

You are receiving this because you are subscribed to this thread.

— Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200357598, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCG4ZUQ44P6QVQJVT4WTZKFTSLAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGM2TONJZHA . You are receiving this because you commented.Message ID: @.***>

yinlabniu commented 4 months ago

That's right. For CAZyme-based abundance profiling, you only need to predict CAZymes (provide your own faa in p5), and you need ffn in p8 and p11. Any processes using contigs can be skipped.

From: Paola88 @.> Sent: Monday, July 1, 2024 11:31 AM To: linnabrown/run_dbcan @.> Cc: Yanbin Yin @.>; Comment @.> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Sorry for the question, bit if I don't generate files.ffn how can I estimate abundance, if i hunderstand, i need the depth file. Thank you

Il Lun 1 Lug 2024, 17:38 Paola Di Gianvito @.***> ha scritto:

Thank you for your answer. If i humderstand i have to do the steps p5 and after p9 in the tutorial, is it right?

Il Lun 1 Lug 2024, 16:45 Yanbin Yin @.***> ha scritto:

Paola,

We do not recommend using run_dbcan for CGC prediction and CGC-based abundance profiling. The reason is that the CGC/PUL concept does not exist in eukaryotes. The gff generated from metaeuk contains exons which will be wrongly treated as separate CDS/genes in run_dbcan. But, you can still use run_dbcan for CAZyme predictions and CAZyme-based abundance profiling, as no gff file will be used.

Yanbin

From: Paola88 @.> Sent: Monday, July 1, 2024 9:24 AM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

— Reply to this email directly, view it on GitHub< https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200299114>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ>.

You are receiving this because you are subscribed to this thread.

— Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200357598, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCG4ZUQ44P6QVQJVT4WTZKFTSLAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGM2TONJZHA . You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200587704, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXNKZRH75FTSOGCUWDQI6TZKF76TAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGU4DONZQGQ. You are receiving this because you commented.

PaolaDiGianvito commented 4 months ago

Hi, i have tried as you suggested to me, but i write you another time because it doesn't work. these ar3e my steps: I have shotgun metagenomic data during wine fermentation and i have done the gene prediction with metaeuk, i have done the steps p5, p8 (after duplication removing), p10 and p11.

At step 13 i have this new error : dbcan_utils fam_abund -bt GC1_D2.depth.txt -i /home/pdigianv/ita_gre/CAZyme/GC1_D2.CAZyme -a TPM You are estimating the abundance of CAZyme! Reads are single end! Total reads count: 549495! Can not find read count information for CAZyme: AA1.aln|k141_836|-|195|8.216e-54|1|149518|151017|151017[151017]:149518[149518]:1500[1500]

even if i have modified the utyls.py as suggested. Can you help me?

Paola Di Gianvito, PhD Tecnologo della ricerca, DISAFA, University of Turin Agricultural Microbiology and Food Technology Sector

Corso Enotria 2/C, Ampelion 12051 Alba - Cuneo - ITALY

Il giorno lun 1 lug 2024 alle ore 18:44 Yanbin Yin @.***> ha scritto:

That's right. For CAZyme-based abundance profiling, you only need to predict CAZymes (provide your own faa in p5), and you need ffn in p8 and p11. Any processes using contigs can be skipped.

From: Paola88 @.> Sent: Monday, July 1, 2024 11:31 AM To: linnabrown/run_dbcan @.> Cc: Yanbin Yin @.>; Comment @.> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Sorry for the question, bit if I don't generate files.ffn how can I estimate abundance, if i hunderstand, i need the depth file. Thank you

Il Lun 1 Lug 2024, 17:38 Paola Di Gianvito @.***> ha scritto:

Thank you for your answer. If i humderstand i have to do the steps p5 and after p9 in the tutorial, is it right?

Il Lun 1 Lug 2024, 16:45 Yanbin Yin @.***> ha scritto:

Paola,

We do not recommend using run_dbcan for CGC prediction and CGC-based abundance profiling. The reason is that the CGC/PUL concept does not exist in eukaryotes. The gff generated from metaeuk contains exons which will be wrongly treated as separate CDS/genes in run_dbcan. But, you can still use run_dbcan for CAZyme predictions and CAZyme-based abundance profiling, as no gff file will be used.

Yanbin

From: Paola88 @.> Sent: Monday, July 1, 2024 9:24 AM To: linnabrown/run_dbcan @.> Cc: Subscribed @.***> Subject: Re: [linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error (Issue #179)

Caution: Non-NU Email

Thank you, i have another question. I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you Paola

— Reply to this email directly, view it on GitHub<

https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200299114>,

or unsubscribe<

https://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ>.

You are receiving this because you are subscribed to this thread.

— Reply to this email directly, view it on GitHub < https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200357598>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/BBFTCG4ZUQ44P6QVQJVT4WTZKFTSLAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGM2TONJZHA>

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub< https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200587704>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AEXNKZRH75FTSOGCUWDQI6TZKF76TAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGU4DONZQGQ>.

You are receiving this because you commented.

— Reply to this email directly, view it on GitHub https://github.com/linnabrown/run_dbcan/issues/179#issuecomment-2200608374, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCG74KTJYIQS527LHKIDZKGBPFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGYYDQMZXGQ . You are receiving this because you commented.Message ID: @.***>

PaolaDiGianvito commented 4 months ago

Good morning, i have tried to run my data from contigs of megahit avoiding metaeuk step and using file.fna and meta for CAZyme annotation, followed by ffn generation from prodigal gffr files and stp 8 and 11. At sep 13 I have ever this error dbcan_utils fam_abund -bt GC1_D2.depth.txt -i ../GC1_D2.CAZyme -a TPM You are estimating the abundance of CAZyme! Reads are single end! Total reads count: 43824341! Can not find read count information for CAZyme: k141_111_4 ... I have changed line 95 in utils.py as suggested def ReadBedtoos(filename): lines = open(filename).readlines() seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[0:]} normalized_tpm = 0. for seqid in seqid2info: seqid_depth = seqid2info[seqid] normalized_tpm += seqid_depth.read_count/seqid_depth.length return seqid2info,normalized_tpm

Can you help me?

powerby66 commented 4 months ago

没错。对于基于 CAZyme 的丰度分析，您只需要预测 CAZymes（在 p5 中提供您自己的 faa），在 p8 和 p11 中需要 ffn。可以跳过使用 contigs 的任何过程。 ____ 来自：Paola88 @.> 发送时间：2024 年 7 月 1 日星期一上午 11:31 收件人：linnabrown/run_dbcan @.> 抄送：Yanbin Yin @.>；评论 @.> 主题：回复：[linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund 和 dbcan_utils CGC_abund 错误（问题#179）警告：非 NU 电子邮件抱歉提出这个问题，但如果我不生成文件，我如何估计丰度，如果我不明白，我需要深度文件。谢谢 2024 年 7 月 1 日星期一，17:38 Paola Di Gianvito @.> 写道：谢谢你的回答。如果我明白，我必须执行教程中的步骤 p5 和 p9 之后的步骤，对吗？2024 年 7 月 1 日，16:45 Yanbin Yin @.> 写道： > Paola， > > 我们不建议使用 run_dbcan 进行 CGC 预测和基于 CGC 的丰度分析。原因是真核生物中不存在 CGC/PUL 概念。从 metaeuk 生成的 gff 包含外显子，这些外显子在 run_dbcan 中会被错误地视为单独的 CDS/基因。但是，您仍然可以使用 run_dbcan 进行 CAZyme 预测和基于 CAZyme 的丰度分析，因为不会使用 gff 文件。 > > Yanbin > ____ > 发件人：Paola88 @.> > 发送时间：2024 年 7 月 1 日星期一 9:24 AM > 收件人：linnabrown/run_dbcan @.> > 抄送：已订阅 @.> > 主题：回复：[linnabrown/run_dbcan] dbcan_utils CGC_substrate_abund 和 dbcan_utils CGC_abund 错误（问题#179） > > 警告：非 NU 电子邮件 > > > 谢谢，我还有一个问题。 > 我已经使用 MetaEuk gor 基因预测，因此我必须用 bedtools 生成 file.ffn，最好使用哪个文件？metaeuk 的输出还是在底物预测时生成的 profigal.gff 文件？谢谢你 > Paola > > — > 直接回复此电子邮件、在 GitHub 上查看< > #179 (评论) >，> 或取消订阅< > https://github.com/notifications/unsubscribe-auth/AEXNKZS2ENS74QD5QZC5FLLZKFRBRAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGI4TSMJRGQ >。 > > 您收到此邮件是因为您订阅了此主题。 > > — > 直接回复此电子邮件、在 GitHub 上查看 > < #179 (评论) >， > 或取消订阅 > < https://github.com/notifications/unsubscribe-auth/BBFTCG4ZUQ44P6QVQJVT4WTZKFTSLAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGM2TONJZHA > > . > 您收到此邮件是因为您发表了评论。消息 ID： > @.> > — 直接回复此电子邮件，在 GitHub 上查看< #179 (评论) >，或取消订阅< https://github.com/notifications/unsubscribe-auth/AEXNKZRH75FTSOGCUWDQI6TZKF76TAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBQGU4DONZQGQ >。您收到此邮件是因为您发表了评论。

May I ask if the GFF and FFN annotation files predicted by Prokka for fungi and protozoa are reliable?