groupschoof / AHRD

High throughput protein function annotation with Human Readable Description (HRDs) and Gene Ontology (GO) Terms.
https://www.cropbio.uni-bonn.de/
Other
63 stars 21 forks source link

What happened to the Berkeley-DB support? #19

Closed asishallab closed 1 year ago

asishallab commented 4 years ago

Previously AHRD wrote all reference descriptions, and optionally the reference Gene Ontology annotations, into a Berkeley-DB. This, as long as the reference sequence databases (e.g. UniProt trEMBL and SwissProt) don't change, speeds up the annotation process from several hours to minutes, because the very large reference fasta and or GOA files don't need to be parsed.

Why has the Berkeley-DB support been removed? What can be done to reactivate it?

Timely feedback will be much appreciated.

FlorianBoecker commented 4 years ago

Dear Asis,

the Berkeley-DB support has not been removed. To be more precise, it wasn't merged into the "master" branch but is still available in the "berkely_db" branch. This branch is otherwise functionally still very close to the "master" branch and users are welcome to use it at their own risk.

We see several drawbacks which led us to the decision to not pursue this feature further:

Best, Florian

asishallab commented 4 years ago

Dear @FlorianBoecker ,

thank you very much for the detailed answer and information.

I will briefly provide my feedback, based on real life experiences stemming from using AHRD. I hope they help to elaborate and clear up doubts.

Execution speed - including Database setup

You are right that the difference in execution time is not that great when testing AHRD on a 1,000 proteins. However, the typical Eukaryotic genome has 20k plus protein coding genes. Also AHRD has been developed to be usable exactly for high throughput, where many other annotation tools fail to perform well. Thus, execution speed should be measured with 20k to 30k query proteins and even small gain of speed should be considered important.

Execution speed - once the DB exists

The database needs to be created only if the reference fasta files have changed. This in a regular compute cluster with proteome annotation pipelines typically does not happen more often than once a month. After having set up the database the performance of AHRD shrinks to a few minutes. AHRD's most faithful user is the PGSB institute in Munich. They use AHRD entirely within the high throughput context and surely don't mind using a some GB (<< 50 GB) for the AHRD database, if that speeds up the annotation process. The same goes for the IBG-4 at the Forschungszentrum Jülich.

Database update

Because the database does not store information about the sequence, it is not possible to infer changes between an outdated version and a new version. Thus, the database as a whole needs to be re-created, if the reference FASTA-files have changed. Actually that only has to be done if some accessions and human readable descriptions changed. In any case, that is typically a once a month execution that can be done right after the reference FASTAs are updated. This typically is scheduled as a CRON job or something similar.

Database as an option

You are right. The Berkeley-DB feature is non optional. That indeed might be considered a drawback.

Need for the Database

Apart from the arguments listed above, AHRD could use some already existing FASTA indexing and querying technology as e.g. provided by legacy Blast or Blast+, namely fastacmd and blastdbcmd. Alas this would not help when searching for reference Gene-Ontology term annotations in EBI's GOA database-files. However, if a user does not use the GO annotation feature of AHRD, the above strategy might be a good alternative to using a database. In any case, to my knowledge the GO annotation feature is not used by any high-throughput user so far.

Thanks and Cheers!