kahst / BirdNET-Analyzer

BirdNET analyzer for scientific audio data processing.
Other
859 stars 157 forks source link

Support for Bulk Formats / Databases #230

Open LimitlessGreen opened 10 months ago

LimitlessGreen commented 10 months ago

Overview

Hello there! πŸ‘‹ I'm currently running experiments with a substantial number of recordings exceeding 100k. Managing these experiments becomes a bit unwieldy when dealing with over 100k individual files, particularly considering that many of them may lack labels and contain only headers.

I primarily work with Python for data processing, and to streamline my workflow and reduce waiting times, I'm currently converting these extensive directories into single parquet files.

In this context, it would be immensely beneficial if BirdNET-Analyser could natively support bulk formats such as parquet. Alternatively, an even more powerful solution would be the inclusion of support for databases like SQLite. This enhancement would not only optimize storage by eliminating redundant headers but also facilitate efficient tracking of numerous experiments.

Requested Features

  1. Bulk Format Support: Integration of bulk formats, such as parquet, to enable more efficient handling of large datasets.
  2. Database Support: Introduction of support for databases, specifically SQLite, to enhance the organization and tracking of experiments.

Benefits

  1. Storage Optimization: Reducing the number of files by utilizing bulk formats or databases would significantly optimize storage space, especially when many files have minimal content (e.g., only headers).
  2. Enhanced Speed: Accessing and processing a smaller number of larger files generally results in improved speed.
  3. Efficient Experiment Management: Database support, particularly with SQLite, would simplify the process of managing and tracking multiple experiments, enhancing overall workflow efficiency.

I believe these enhancements would greatly benefit users dealing with extensive datasets and contribute to a more streamlined and efficient user experience. Looking forward to the possibility of incorporating these features into future releases! πŸš€

Thank you a lot for this great project! πŸ‘


EDIT:

To give a perspective of space: I have an experiment, where:

kahst commented 10 months ago

You're right, it gets kind of complicated for extremely large datasets. I've always been a fan of file-based results formats because they're easy to access (i.e. you can always open a text file, but might need extra code to read an embedded format). However, I acknowledge that there is an upside in supporting other exchange formats. I'm not sure if the analyzer is the right repo to implement these things, because it evolved more into an application rather than a package or framework. The same goes for the Python bindings (#121) which I think could be a great feature, but maybe something that birdnetlib should cover. What do you think?

LimitlessGreen commented 10 months ago

Thank you for your response :)

I'm not against the results in single file formats. I also like them. Databases might potentially exceed the framework, but the effort for bulk formats should be manageable. I quite appreciate the command-line-based approach of this repository. I'm thinking of an additional command-line parameter, such as analyze.py --rtype r --merge csv/parquet.

birdnetlib for bindings: βœ”οΈ The issue is that many functions there have not been implemented yet, for example, the ability to output embeddings, training, or exporting segments.

In my opinion, the effort for producing combined CSV data or Parquet files using libraries like Pandas or Dask should be limited.