nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.68k stars 622 forks source link

Embedded cache DB alternative #2774

Open pditommaso opened 2 years ago

pditommaso commented 2 years ago

Bug report

Nextflow tasks metadata is stored into a local embedded key-value database based on LevelDB.

This provides good performance, however, the LevelDB store has some stability issues on specific hardware/file systems causing and represent blocking factor for those users. See for example: https://github.com/nextflow-io/nextflow/issues/2377, https://github.com/nextflow-io/nextflow/issues/403, https://github.com/nextflow-io/nextflow/issues/351 and https://github.com/nextflow-io/nextflow/issues/309

The goal of this issue is to explore the use of lmdbjava as alternative storage for nextflow tasks metadata

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bentsherman commented 1 year ago

I guess this one is resolved by the cloud cache #4097

I know we were also considering parquet, but parquet is a columnar storage format so wouldn't be a good fit for the task cache. Instead any cache backend should be a true key-value store or at least row-based.

bentsherman commented 1 year ago

Although I see you were looking into LMDB. That should be a good choice. I can look into it if you want

pditommaso commented 1 year ago

I gave a try in the past to Lmdb in the past I was not really convinced: weird API, native OS dependencies, also it does not even really really maintained any more.

Maybe we should give a try to thin wrapper over classic Sqlite or Duckbd.

bentsherman commented 1 year ago

That's too bad, LMDB seems to have really good performance. SQLite could be a good option. Probably not DuckDB though, since it is designed for OLAP. It could be useful to export a cache DB to DuckDB for downstream analytics, but not during the pipeline execution.

pditommaso commented 1 year ago

Only a benchmark can really tell