What happened to the Berkeley-DB support?

asishallab commented 4 years ago

Previously AHRD wrote all reference descriptions, and optionally the reference Gene Ontology annotations, into a Berkeley-DB. This, as long as the reference sequence databases (e.g. UniProt trEMBL and SwissProt) don't change, speeds up the annotation process from several hours to minutes, because the very large reference fasta and or GOA files don't need to be parsed.

Why has the Berkeley-DB support been removed? What can be done to reactivate it?

Timely feedback will be much appreciated.

FlorianBoecker commented 4 years ago

Dear Asis,

the Berkeley-DB support has not been removed. To be more precise, it wasn't merged into the "master" branch but is still available in the "berkely_db" branch. This branch is otherwise functionally still very close to the "master" branch and users are welcome to use it at their own risk.

We see several drawbacks which led us to the decision to not pursue this feature further:

In addition to the database(s) fasta(s) and the reference annotations the user must provide even more storage space for the berkely-db (~ 14GB)
With an additional data source a new source for errors is also introduced. Usually for each execution of AHRD it is very clear what input data is used as it has to be explicitly set in the yml-file. With the berkeley-db in between, AHRD's output is also influenced by its previous executions. We are unsure how reliable AHRD detects new versions of databases that necessitate discarding data from the berkeley-db.
In our testing (1000 query proteins, SwissProt, Trembl, with GO annotation) the speed up is not as drastic as you portray it. Yes, the typical runtime was approximately shortened from 21 min to 6min but the initial run (which includes building the berkleley-db) was blown up to 200 min. Thus considering the 10 repetitions we performed the berkely_db-version didn't even hit the break even point (sum berkely_db: 261min vs. sum master: 216min). This brings us to the question who profits from using the berkely-db in AHRD? What is the use case where so many annotations have to be performed on the same version of the source databases? We assume the typical use case for AHRD (annotation of a novel proteome as part of a genome project) to happen for a given user infrequently enough that often new versions of the reference databases are available anyway.
And last but absolutely not least: The feature is not optional. It should be opt in, but there isn't even the possibility to opt out. And the berkely-db is build in so deeply into AHRD that we see no way to implement a simple "off switch". So all the drawbacks stated above are forced onto all users, and new users will be especially affected because of the long build times in the first execution of AHRD.

Best, Florian

asishallab commented 4 years ago

Dear @FlorianBoecker ,

thank you very much for the detailed answer and information.

I will briefly provide my feedback, based on real life experiences stemming from using AHRD. I hope they help to elaborate and clear up doubts.

Execution speed - including Database setup

You are right that the difference in execution time is not that great when testing AHRD on a 1,000 proteins. However, the typical Eukaryotic genome has 20k plus protein coding genes. Also AHRD has been developed to be usable exactly for high throughput, where many other annotation tools fail to perform well. Thus, execution speed should be measured with 20k to 30k query proteins and even small gain of speed should be considered important.

Execution speed - once the DB exists

The database needs to be created only if the reference fasta files have changed. This in a regular compute cluster with proteome annotation pipelines typically does not happen more often than once a month. After having set up the database the performance of AHRD shrinks to a few minutes. AHRD's most faithful user is the PGSB institute in Munich. They use AHRD entirely within the high throughput context and surely don't mind using a some GB (<< 50 GB) for the AHRD database, if that speeds up the annotation process. The same goes for the IBG-4 at the Forschungszentrum Jülich.

Database update

Because the database does not store information about the sequence, it is not possible to infer changes between an outdated version and a new version. Thus, the database as a whole needs to be re-created, if the reference FASTA-files have changed. Actually that only has to be done if some accessions and human readable descriptions changed. In any case, that is typically a once a month execution that can be done right after the reference FASTAs are updated. This typically is scheduled as a CRON job or something similar.

Database as an option

You are right. The Berkeley-DB feature is non optional. That indeed might be considered a drawback.

Need for the Database

Apart from the arguments listed above, AHRD could use some already existing FASTA indexing and querying technology as e.g. provided by legacy Blast or Blast+, namely fastacmd and blastdbcmd. Alas this would not help when searching for reference Gene-Ontology term annotations in EBI's GOA database-files. However, if a user does not use the GO annotation feature of AHRD, the above strategy might be a good alternative to using a database. In any case, to my knowledge the GO annotation feature is not used by any high-throughput user so far.

Thanks and Cheers!

groupschoof / AHRD