paidiver / paidiverpy

Create pipelines for preprocessing image data for biodiversity analysis.
Apache License 2.0
3 stars 0 forks source link

General data investigation conducted for each parameter of interest for processing #27

Open soutobias opened 1 month ago

soutobias commented 1 month ago

General data investigation conducted for each parameter of interest for processing, to determine (i) any images to be removed or subsetted, (ii) criteria or threshold values for processing.

Think about the desired aim of this processing step with reference to the overall biodiversity aims of the project/deployment, and any ancillary aims (e.g., retaining the maximum number of images). Establish what would be considered successful processing and any criteria for testing that success.

soutobias commented 1 month ago

What:

General data investigation involves a thorough analysis of each parameter of interest related to image processing. This step is crucial for determining whether any images should be removed or subsetted and for establishing criteria or threshold values for processing. It helps in understanding the data distribution, identifying important patterns, and ensuring that the processing aligns with the overall objectives of the project.

Why:

Conducting a general data investigation ensures that the processing criteria are well-founded and tailored to the specific characteristics of the dataset. This is essential for achieving meaningful and accurate results, particularly in biodiversity studies where precise image analysis can impact the assessment of species and habitats. It also helps in setting appropriate thresholds and detecting any anomalies that could affect the quality and reliability of the processed data.

How:

  1. Define Aims and Success Criteria:

    • Determine the primary objective of the processing step in relation to the overall biodiversity goals of the project. For example, the aim could be to retain the maximum number of usable images while ensuring high quality.
    • Establish what constitutes successful processing, including specific criteria or metrics for evaluating success. For instance, successful processing might involve retaining images within a defined quality range or achieving certain statistical thresholds.
  2. Consider Conditions from Deployment Notes:

    • Review cruise reports or deployment notes for any conditions or occurrences that might have impacted the parameter of interest. For example, specific environmental factors, equipment malfunctions, or operational constraints should be considered, as they may affect image quality and influence decisions about image removal or special processing needs.
  3. Visualize Data:

    • Plot the parameter of interest across the deployment or image capture event, including camera position metrics (latitude, longitude, water depth, altitude above seabed) and camera performance metrics. Create histograms to visualize the distribution of parameter values.
    • Analyze patterns, shapes of curves, inflection points, and extreme values. Identify any trends or anomalies that might help in setting appropriate thresholds or criteria.
  4. Determine Thresholds and Criteria:

    • Use statistical methods to identify extremes, outliers, and acceptable ranges for the parameter of interest. For example, calculate mean, median, standard deviation, and percentiles to establish thresholds.
    • Plot images within and outside these thresholds to visually verify that the chosen criteria correctly distinguish between usable and non-usable images.
  5. Test and Refine Criteria:

    • Apply the established criteria to the image dataset and evaluate the processing step based on the defined success metrics. Assess whether the criteria effectively identify and handle images according to the project's goals.
    • Repeat the data investigation steps as necessary to refine the criteria. Reevaluate the processed images to ensure that the processing meets the success criteria and aligns with the project's aims.

What to expect:

The outcome of this investigation should be a well-defined set of criteria or thresholds for processing images, informed by statistical analysis and visual verification. The processed dataset should reflect the quality and characteristics desired for achieving the project's biodiversity objectives.

What makes it difficult:

Success Metrics:

LoicVA commented 1 month ago

I would add here: identify and verify relationships with other variables. Is this Jen's document? Otherwise, I would rather correct that one directly.