linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
144 stars 40 forks source link

Missing metadata for CAZyDB.07312019.fa #34

Closed aleu785 closed 4 years ago

aleu785 commented 4 years ago

Hello Linna

I was wondering if you could provide more information on the proteins included in the CAZyDB.07312019.fa file (e.g. EC number and protein name).

I noticed there was a file called CAZyDB.07312019.fam.subfam.ec.txt, however the proteins in the file do not match up with the CAZyDB.07312019.fa. Is there a reason for this? Am i missing a file?

Hope to hear from you soon.

Thanks! Andy

yinlabniu commented 4 years ago

Andy,

These two files should match up. Can you give us some examples that don’t match? We will investigate then.

Thanks,

Yanbin

On Wednesday, February 26, 2020, aleu785 notifications@github.com wrote:

Hello Linna

I was wondering if you could provide more information on the proteins included in the CAZyDB.07312019.fa file (e.g. EC number and protein name).

I noticed there was a file called CAZyDB.07312019.fam.subfam.ec.txt, however the proteins in the file do not match up with the CAZyDB.07312019.fa. Is there a reason for this? Am i missing a file?

Hope to hear from you soon.

Thanks! Andy

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown_run-5Fdbcan_issues_34-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAEXNKZQ6JSEN4UBE2K26HLLREYBLTA5CNFSM4K35XEL2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IQKFZPA&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=gqP8DA6Y-oyw2tk_kaON_RuQOpR8MXhZpHrqmuvXPzc&s=IOjwGb448QBoaFlMRfKWY_f6_TELA3YmbMi93H6FrPM&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEXNKZV4T7DYHSCMP2MOLWLREYBLTANCNFSM4K35XELQ&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=gqP8DA6Y-oyw2tk_kaON_RuQOpR8MXhZpHrqmuvXPzc&s=VMSUUZaRixeBVEEf0x80lZN8CID1CLnVhMFi4doVwsI&e= .

-- Yanbin Yin, PhD Associate Professor Computational Biologist Department of Food Science and Technology Nebraska Food for Health Center Quantitative Life Sciences Initiative University of Nebraska-Lincoln Office: Food Innovation Center 253 Lab: FIC 208/317 Tel: 402-472-4303 Email: yyin@unl.edu Web: https://foodsci.unl.edu/yin; http://bcb.unl.edu

aleu785 commented 4 years ago

Hello Yanbin

The file CAZyDB.07312019.fam.subfam.ec.txt has 32,163 entries while there are 1,386,849 sequences in the CAZyDB.07312019.fa.

The files were taken from here: http://bcb.unl.edu/dbCAN2/download/Databases/

Cheers, Andy

Li-Dongyao-ancore commented 4 years ago

@aleu785 File CAZyDB.07312019.fam.subfam.ec.txt only contains the entries with EC number, so it's not surprising that 1,386,849 is much larger than 32,163. I have greped the entries with EC from CAZyDB.07312019.fa and its count turns out to be 10,147 (less than 32,163). This difference is partly because that mutiple domains were separately recorded in the CAZyDB.07312019.fam.subfam.ec.txt. However, the problem about not matching up is real. For example, entry AAA03217.1 (EC 2.4.1.17) in CAZyDB.07312019.fam.subfam.ec.txt is not included in CAZyDB.07312019.fa.

yinlabniu commented 4 years ago

Thanks for giving an example. The http://www.cazy.org/GT1_characterized.html?debut_FUNC=300#pagination_FUNC page has AAA03217.1. This protein is the same as and represented by AAA03216.1 (shown in bold in the CAZy page). In other words, AAA03216.1 and AAA03217.1 are considered as duplicates by CAZy. Therefore, we only kept one representative protein in the fasta file (CAZyDB.07312019.fa), but still kept all proteins/duplicates with EC numbers in the ec file (CAZyDB.07312019.fam.subfam.ec.txt).

Yanbin

On Fri, Feb 28, 2020 at 3:03 AM Dongyao Li notifications@github.com wrote:

@aleu785 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_aleu785&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=iXsVPPpL0kkn-n0juZpJeUFMIpLK-a0oA-DTN2hvcyU&s=4rA40xL2yQPJawr9ywWKpwxmRHj4fiCEKyB_nJQAyEo&e= File CAZyDB.07312019.fam.subfam.ec.txt only contains the entries with EC numbers, so it's not surprising that 1,386,849 is much larger than 32,163. I have greped the entries with EC from CAZyDB.07312019.fa and its count turns out to be 10,147 (less than 32,163). This difference is partly because that mutiple domains were separately recorded in the CAZyDB.07312019.fam.subfam.ec.txt. However, the problem about matching up is real. For example, entry AAA03217.1 (EC 2.4.1.17) in CAZyDB.07312019.fam.subfam.ec.txt is not included in CAZyDB.07312019.fa.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_linnabrown_run-5Fdbcan_issues_34-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAEXNKZTQT3CHVGDVCRBRXQ3RFDHNVA5CNFSM4K35XEL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENHZK5I-23issuecomment-2D592418165&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=iXsVPPpL0kkn-n0juZpJeUFMIpLK-a0oA-DTN2hvcyU&s=T49SC7dUKH7b-jT9KrInzUDZy_6FbQqkFLLdj-LbfKU&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEXNKZUXOCRSJF3O5BOVI6DRFDHNVANCNFSM4K35XELQ&d=DwMCaQ&c=Cu5g146wZdoqVuKpTNsYHeFX_rg6kWhlkLF8Eft-wwo&r=f65eEPN7tgPSqkv5z4zNJA&m=iXsVPPpL0kkn-n0juZpJeUFMIpLK-a0oA-DTN2hvcyU&s=ZUCbyIgIsqJKMLtfuwlCdyHFU_jCwKxOFaZpCmK6gvo&e= .

-- Yanbin Yin, PhD Associate Professor Computational Biologist Department of Food Science and Technology Nebraska Food for Health Center Quantitative Life Sciences Initiative University of Nebraska-Lincoln Office: Food Innovation Center 253 Lab: FIC 208/317 Tel: 402-472-4303 Email: yyin@unl.edu Web: https://foodsci.unl.edu/yin; http://bcb.unl.edu

aleu785 commented 4 years ago

Ok cool. Thanks for the clarification!

Li-Dongyao-ancore commented 4 years ago

Thanks for Dr. Yin's clarification.