peterjc / galaxy_blast

Galaxy wrappers for NCBI BLAST+ and related BLAST tools.
76 stars 70 forks source link

makeblastdb file output appears to be too big #122

Open balags1 opened 4 years ago

balags1 commented 4 years ago

Command used was: makeblastdb -in seq-contigs.fasta -out seqdb -parse_seqids -dbtype nucl

From a 4 MB fasta file, this is creating blast databases of size 500 GB+. Is this normal? What could be wrong with what I am doing?

peterjc commented 4 years ago

That is not normal. Can you share the FASTA file? My email to my Google account if it is private?

balags1 commented 4 years ago

It is happening with any standard genome fasta file, doesn't appear to be file specific.

balags1 commented 4 years ago

NZ_CP015724.1.fasta.txt

The issue is with version 2.10.0+, I also have an older version 2.2.3+ that doesn't produce these big files. Both are the windows 64-bit versions of Blast+. V2.10.0+ is creating .ndb and .ntf files that are 297 GB in size.

peterjc commented 4 years ago

Ah. I wonder if this is due to the new v5 BLAST database format? It would be surprising but not impossible that they are optimised for larger database.

The Galaxy wrappers / provided BLAST database datatype doesn't actually know about the new extensions, but that is a separate problem:

https://github.com/peterjc/galaxy_blast/blob/master/datatypes/blast_datatypes/blast.py#L244

I have not made time to explore this yet - and have limited time this week due to childcare.

balags1 commented 4 years ago

Duly noted. From a resources perspective, we will stick to the prior version for the time being.