linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
144 stars 40 forks source link

Duplicate entries in CAZyDB.07312019.fa #35

Closed zeyaxue closed 4 years ago

zeyaxue commented 4 years ago

Hi Linna,

I downloaded the pre-compiled CAZyDB.07312019.fa from the dbcan site to use with our inhouse DIAMOND method. I found that there are duplicate headers in the fasta file with exact sequences. For example, there are 2 entries of

>BBK85634.1|CBM20|GH77| MTLIFNIEYRTSWGEEVRVLGSIPELGNNQPNKATPLHTVDGIHWTAEVDIQIPGNGSVEYSYHIYRDGRTIRTEWNSLPRILHVADNPKKVYRIEDCWKNLPEQQYFYTSAFTESLLAHRERSAAPKSYKKGLLIKAYAPCIDSDHCLALCGNQKALGDWNPDKAALMSDIDFPEWQVEVDAGKISFPLEYKFVLYNKKERRAVAWENNPNRYMADPQIAANETLAVGDRYVYFNLPAWKGSGVAVPVFSLRSEKSFGVGDFGDLKRMIDWAVATNQKAVQILPINDTTMTHTWTDSYPYSSISIYAFHPMYADLKQLGSLKDKKVMAEFNKRQKELNALPAVDYEAVNKTKWEYFHLIFKQEGEKVLASDAFRNFYEANKEWLQPYAVFSYLRDAYKTPNFREWAKYATYDAKEIETLCRPDSADYPHIAIYYYIQFNLHLQLLAATEHARANGVVLKGDIPIGISRNSVEAWKEPHYFNLNGQAGAPPDDFSVNGQNWGLPTYNWDVMEKDGYAWWMKRFHKMAEYFDAYRIDHILGFFRIWEIPMHAVHGLLGQFVPALPMTREEIESYGLAFREDFFLKPYIHEYFLGQIFGPHTDYVKQTFIEPTDTWEVYRMRPEFDTQRKVEAYFAGKTDDDSIWIRDGLYALISDVLFVPDRNNPHEYHPRIGVQHDYIYRALNDWEKAAFNRLYDQYYYHRHNDFWGQQAMKKLPQLTQSTHMLVCGEDLGMIPDCVAWVMNDLRILSLEIQRMPKDPKQEFGHTDWYPYRSVCTISTHDMSTLRGWWEEDFQQTQRYYNTMLGHYGAAPATATPELCEEVVRNHLHSNSILCILSLQDWMSMDGKWRNPNVQEERINIPANPRHYWRWRMHLTLEQLMKAESLNEKIRSMIESTGR

I ran the following commands and found the list of repeated headers, see the attached txt file. repeated_headers.txt

grep ">" CAZyDB.07312019.fa > all_headers.txt sort all_headers.txt | uniq -d > repeated_headers.txt

Is there a reason for duplication or was this an error?

Thanks, Zeya Xue

yinlabniu commented 4 years ago

Thanks for catching this. For proteins with n different CAZyme domains, they all appear n times in the CAZyDB.07312019.fa file. I have removed these duplicates and uploaded a non-redundant file to the download folder: http://bcb.unl.edu/dbCAN2/download/CAZyDB.07312019.fa.nr.