soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.46k stars 198 forks source link

mmseqs download will no longer get the most up-to-date nr/nt databases #893

Open charlesfoster opened 1 month ago

charlesfoster commented 1 month ago

Expected Behavior

mmseqs download would be expected to download an up-to-date version of the target 'nr' and 'nt' databases.

Current Behavior

The download FASTA targets for the 'nr' and 'nt' databases are no longer being updated by NCBI. Explanation: focusing on 'NR' as an example, the URL in databases.sh points to https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz. The README in that FTP location states:

In April 2024, the BLAST FASTA files in this directory will no longer be available. You can easily generate FASTA files yourself from the formatted BLAST databases by using the BLAST utility blastdbcmd that comes with the standalone BLAST programs. See NCBI Insights for more details https://ncbiinsights.ncbi.nlm.nih.gov/2024/01/25/blast-fasta-unavailable-on-ftp/

Indeed, the nr.gz file was last updated on 2024-02-07.

Looking in the parent directoy, the various NR database files have been updated as recently as 2024-10-02. Therefore, the download targets for mmseqs2 are out of date by about 8 months, and this problem will get worse over time.

NCBI's solution for users is to download the blast-format files and then generate their own FASTA files:

  • Sequences in FASTA format can be generated from the pre-formatted databases by using the blastdbcmd utility;

Obviously this isn't ideal for many users, and it's been getting at least some hate.

Solution

Unless NCBI backflips on their decision, the only solution would be to change the mmseqs databases workflow for these databases to have an intermediate (slow) step of extracting a FASTA file before the mmseqs createdb step is run. However, this would obviously require extra dependencies, i.e. the blastdbcmd. Otherwise, I suppose you could host periodic builds of the databases on a server or something.

Just thought I should bring this to your attention in case you are unaware :smile:

milot-mirdita commented 1 month ago

I would recommend to just use UniProt instead of NR. it’s much better maintained, especially with the versioning. Any annotations against the NR are essentially unreproducible due to the lack of versioning by the NCBI.

I don’t plan on integrating the blast databases for the foreseeable future.