salesforce / progen

Official release of the ProGen models
BSD 3-Clause "New" or "Revised" License
605 stars 111 forks source link

How is the sequence ID calculated in an efficient manner #16

Open eric-jm-lang opened 1 year ago

eric-jm-lang commented 1 year ago

Hello, In your excellent paper, a key asspect used is the sequence identity between the artificial and any known natural sequences. May I ask how this sequence identity is calculated in an effective manner? As it requires to screen all the databases for each sequences. Many thanks in advance

jeffreyruffolo commented 1 year ago

These values are calculated using the MMseqs2 tool to find the closest matches between the generated sequences and the protein databases. We report the identity to the top database hit for each generated sequence.