Closed zeyaxue closed 4 years ago
Thanks for catching this. For proteins with n different CAZyme domains, they all appear n times in the CAZyDB.07312019.fa file. I have removed these duplicates and uploaded a non-redundant file to the download folder: http://bcb.unl.edu/dbCAN2/download/CAZyDB.07312019.fa.nr.
Hi Linna,
I downloaded the pre-compiled CAZyDB.07312019.fa from the dbcan site to use with our inhouse DIAMOND method. I found that there are duplicate headers in the fasta file with exact sequences. For example, there are 2 entries of
I ran the following commands and found the list of repeated headers, see the attached txt file. repeated_headers.txt
grep ">" CAZyDB.07312019.fa > all_headers.txt
sort all_headers.txt | uniq -d > repeated_headers.txt
Is there a reason for duplication or was this an error?
Thanks, Zeya Xue