Open charlesfoster opened 1 month ago
I would recommend to just use UniProt instead of NR. it’s much better maintained, especially with the versioning. Any annotations against the NR are essentially unreproducible due to the lack of versioning by the NCBI.
I don’t plan on integrating the blast databases for the foreseeable future.
Expected Behavior
mmseqs download
would be expected to download an up-to-date version of the target 'nr' and 'nt' databases.Current Behavior
The download FASTA targets for the 'nr' and 'nt' databases are no longer being updated by NCBI. Explanation: focusing on 'NR' as an example, the URL in databases.sh points to https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz. The README in that FTP location states:
Indeed, the nr.gz file was last updated on 2024-02-07.
Looking in the parent directoy, the various NR database files have been updated as recently as 2024-10-02. Therefore, the download targets for mmseqs2 are out of date by about 8 months, and this problem will get worse over time.
NCBI's solution for users is to download the blast-format files and then generate their own FASTA files:
Obviously this isn't ideal for many users, and it's been getting at least some hate.
Solution
Unless NCBI backflips on their decision, the only solution would be to change the
mmseqs databases
workflow for these databases to have an intermediate (slow) step of extracting a FASTA file before themmseqs createdb
step is run. However, this would obviously require extra dependencies, i.e. theblastdbcmd
. Otherwise, I suppose you could host periodic builds of the databases on a server or something.Just thought I should bring this to your attention in case you are unaware :smile: