Closed aleu785 closed 4 years ago
Andy,
These two files should match up. Can you give us some examples that don’t match? We will investigate then.
Thanks,
Yanbin
On Wednesday, February 26, 2020, aleu785 notifications@github.com wrote:
Hello Linna
I was wondering if you could provide more information on the proteins included in the CAZyDB.07312019.fa file (e.g. EC number and protein name).
I noticed there was a file called CAZyDB.07312019.fam.subfam.ec.txt, however the proteins in the file do not match up with the CAZyDB.07312019.fa. Is there a reason for this? Am i missing a file?
Hope to hear from you soon.
Thanks! Andy
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown_run-5Fdbcan_issues_34-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAEXNKZQ6JSEN4UBE2K26HLLREYBLTA5CNFSM4K35XEL2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IQKFZPA&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=gqP8DA6Y-oyw2tk_kaON_RuQOpR8MXhZpHrqmuvXPzc&s=IOjwGb448QBoaFlMRfKWY_f6_TELA3YmbMi93H6FrPM&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEXNKZV4T7DYHSCMP2MOLWLREYBLTANCNFSM4K35XELQ&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=gqP8DA6Y-oyw2tk_kaON_RuQOpR8MXhZpHrqmuvXPzc&s=VMSUUZaRixeBVEEf0x80lZN8CID1CLnVhMFi4doVwsI&e= .
-- Yanbin Yin, PhD Associate Professor Computational Biologist Department of Food Science and Technology Nebraska Food for Health Center Quantitative Life Sciences Initiative University of Nebraska-Lincoln Office: Food Innovation Center 253 Lab: FIC 208/317 Tel: 402-472-4303 Email: yyin@unl.edu Web: https://foodsci.unl.edu/yin; http://bcb.unl.edu
Hello Yanbin
The file CAZyDB.07312019.fam.subfam.ec.txt has 32,163 entries while there are 1,386,849 sequences in the CAZyDB.07312019.fa.
The files were taken from here: http://bcb.unl.edu/dbCAN2/download/Databases/
Cheers, Andy
@aleu785 File CAZyDB.07312019.fam.subfam.ec.txt only contains the entries with EC number, so it's not surprising that 1,386,849 is much larger than 32,163. I have greped the entries with EC from CAZyDB.07312019.fa and its count turns out to be 10,147 (less than 32,163). This difference is partly because that mutiple domains were separately recorded in the CAZyDB.07312019.fam.subfam.ec.txt. However, the problem about not matching up is real. For example, entry AAA03217.1 (EC 2.4.1.17) in CAZyDB.07312019.fam.subfam.ec.txt is not included in CAZyDB.07312019.fa.
Thanks for giving an example. The http://www.cazy.org/GT1_characterized.html?debut_FUNC=300#pagination_FUNC page has AAA03217.1. This protein is the same as and represented by AAA03216.1 (shown in bold in the CAZy page). In other words, AAA03216.1 and AAA03217.1 are considered as duplicates by CAZy. Therefore, we only kept one representative protein in the fasta file (CAZyDB.07312019.fa), but still kept all proteins/duplicates with EC numbers in the ec file (CAZyDB.07312019.fam.subfam.ec.txt).
Yanbin
On Fri, Feb 28, 2020 at 3:03 AM Dongyao Li notifications@github.com wrote:
@aleu785 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_aleu785&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=iXsVPPpL0kkn-n0juZpJeUFMIpLK-a0oA-DTN2hvcyU&s=4rA40xL2yQPJawr9ywWKpwxmRHj4fiCEKyB_nJQAyEo&e= File CAZyDB.07312019.fam.subfam.ec.txt only contains the entries with EC numbers, so it's not surprising that 1,386,849 is much larger than 32,163. I have greped the entries with EC from CAZyDB.07312019.fa and its count turns out to be 10,147 (less than 32,163). This difference is partly because that mutiple domains were separately recorded in the CAZyDB.07312019.fam.subfam.ec.txt. However, the problem about matching up is real. For example, entry AAA03217.1 (EC 2.4.1.17) in CAZyDB.07312019.fam.subfam.ec.txt is not included in CAZyDB.07312019.fa.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown_run-5Fdbcan_issues_34-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAEXNKZTQT3CHVGDVCRBRXQ3RFDHNVA5CNFSM4K35XEL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENHZK5I-23issuecomment-2D592418165&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=iXsVPPpL0kkn-n0juZpJeUFMIpLK-a0oA-DTN2hvcyU&s=T49SC7dUKH7b-jT9KrInzUDZy_6FbQqkFLLdj-LbfKU&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEXNKZUXOCRSJF3O5BOVI6DRFDHNVANCNFSM4K35XELQ&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=iXsVPPpL0kkn-n0juZpJeUFMIpLK-a0oA-DTN2hvcyU&s=ZUCbyIgIsqJKMLtfuwlCdyHFU_jCwKxOFaZpCmK6gvo&e= .
-- Yanbin Yin, PhD Associate Professor Computational Biologist Department of Food Science and Technology Nebraska Food for Health Center Quantitative Life Sciences Initiative University of Nebraska-Lincoln Office: Food Innovation Center 253 Lab: FIC 208/317 Tel: 402-472-4303 Email: yyin@unl.edu Web: https://foodsci.unl.edu/yin; http://bcb.unl.edu
Ok cool. Thanks for the clarification!
Thanks for Dr. Yin's clarification.
Hello Linna
I was wondering if you could provide more information on the proteins included in the CAZyDB.07312019.fa file (e.g. EC number and protein name).
I noticed there was a file called CAZyDB.07312019.fam.subfam.ec.txt, however the proteins in the file do not match up with the CAZyDB.07312019.fa. Is there a reason for this? Am i missing a file?
Hope to hear from you soon.
Thanks! Andy