Open LimitlessGreen opened 10 months ago
You're right, it gets kind of complicated for extremely large datasets. I've always been a fan of file-based results formats because they're easy to access (i.e. you can always open a text file, but might need extra code to read an embedded format). However, I acknowledge that there is an upside in supporting other exchange formats. I'm not sure if the analyzer is the right repo to implement these things, because it evolved more into an application rather than a package or framework. The same goes for the Python bindings (#121) which I think could be a great feature, but maybe something that birdnetlib
should cover. What do you think?
Thank you for your response :)
I'm not against the results in single file formats. I also like them. Databases might potentially exceed the framework, but the effort for bulk formats should be manageable. I quite appreciate the command-line-based approach of this repository. I'm thinking of an additional command-line parameter, such as analyze.py --rtype r --merge csv/parquet
.
birdnetlib for bindings: βοΈ The issue is that many functions there have not been implemented yet, for example, the ability to output embeddings, training, or exporting segments.
In my opinion, the effort for producing combined CSV data or Parquet files using libraries like Pandas or Dask should be limited.
Overview
Hello there! π I'm currently running experiments with a substantial number of recordings exceeding 100k. Managing these experiments becomes a bit unwieldy when dealing with over 100k individual files, particularly considering that many of them may lack labels and contain only headers.
I primarily work with Python for data processing, and to streamline my workflow and reduce waiting times, I'm currently converting these extensive directories into single parquet files.
In this context, it would be immensely beneficial if BirdNET-Analyser could natively support bulk formats such as parquet. Alternatively, an even more powerful solution would be the inclusion of support for databases like SQLite. This enhancement would not only optimize storage by eliminating redundant headers but also facilitate efficient tracking of numerous experiments.
Requested Features
Benefits
I believe these enhancements would greatly benefit users dealing with extensive datasets and contribute to a more streamlined and efficient user experience. Looking forward to the possibility of incorporating these features into future releases! π
Thank you a lot for this great project! π
EDIT:
To give a perspective of space: I have an experiment, where:
2.8 GB
115 MB