inbo / aspbo

The alien species portal backoffice contains automated data preparation scripts for the [alien species portal](https://github.com/inbo/alien-species-portal)
0 stars 0 forks source link

create flow to download cubes from gbif #18

Open SanderDevisscher opened 1 year ago

SanderDevisscher commented 1 year ago
SanderDevisscher commented 1 year ago

@damianooldoni what is the status on downloading the species cubes directly from gbif ?

damianooldoni commented 1 year ago

Hi @SanderDevisscher: unfortunately this is not planned for 2023 yet, my mistake. But they can create cubes on demand if you need them.

I will aks for an occurrence cube. Are these below your requirements?

CONSTRAINTS:

  1. taxonomic: everything
  2. time: everything
  3. space: Belgium

GRANULARITY:

  1. taxonomic: taxa as mentioned in the GRIIS checklist (some are species, some subspecies, some are genus...)
  2. time: year
  3. space: 1x1km EEA grid

    If they will not be able to provide it (very unlikely), then I will produce it as I did the last years.

SanderDevisscher commented 1 year ago

Hi @SanderDevisscher: unfortunately this is not planned for 2023 yet, my mistake. But they can create cubes on demand if you need them.

too bad, would be nice to write it as it would become.

I will aks for an occurrence cube. Are these below your requirements?

CONSTRAINTS:

  1. taxonomic: everything
  2. time: everything
  3. space: Belgium

GRANULARITY:

  1. taxonomic: taxa as mentioned in the GRIIS checklist (some are species, some subspecies, some are genus...)
  2. time: year
  3. space: 1x1km EEA grid

I think so, I essentially need an update, new species due to updated of GRIIS checklist & new observations due to passing of time, of the alienSpecies cubes be_alientaxa_cube.csv & be_classes_cube.csv as done in trias-project/occ-cube-alien.

If they will not be able to provide it (very unlikely), then I will produce it as I did the last years.

Thanks

damianooldoni commented 1 year ago

FYI, you can follow my request to GBIF here: https://github.com/gbif/occurrence-cube/issues/3

SanderDevisscher commented 1 month ago

Het heeft iets meer geduurd dan verwacht, maar het was de moeite waard denk ik 🙂 Github page: https://damianooldoni.github.io/b3cubes-sql-examples/ GitHub repo: https://github.com/damianooldoni/b3cubes-sql-examples

Zoals vermeld in homepage zelf, idee is om deze documentatie toevoegen aan officiële https://docs.b-cubed.eu/. Ik heb hierover issue geopend (#44).

Enjoy cubing!

posted by @damianooldoni on 2024-09-11 in google chat

damianooldoni commented 1 month ago

Update: there is a rgbif function to trigger SQL API: occ_download_sql() 🥳 I would not change the written tutorials. I would add a new; short one, I suppose.

SanderDevisscher commented 1 month ago

Thanks for the update 😄

SanderDevisscher commented 1 month ago

To pass parameters from a GitHub Actions runner to an R Markdown (.Rmd) file, you can utilize the params feature in R Markdown along with the rmarkdown::render() function in your GitHub Actions workflow. Here’s how to set it up:
Step-by-Step Setup
1. Modify Your R Markdown Document
First, ensure your R Markdown file is set up to accept parameters. You can define parameters in the YAML header of your .Rmd file like this:
text
---
title: "Parameterized Report"
output: html_document
params:
  my_param: NULL
---

```{r}
# Use the parameter in your analysis
cat("The parameter value is:", params$my_param, "\n")

text

### 2. **Create or Edit Your GitHub Actions Workflow**

Next, create or edit a workflow YAML file in your repository (e.g., `.github/workflows/render-rmd.yml`). Below is an example of how to set up the workflow to pass parameters to the R Markdown document:

```yaml
name: Render R Markdown

on:
  push:
    branches:
      - main

jobs:
  render:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: Set up R
        uses: r-lib/actions/setup-r@v2

      - name: Install dependencies
        run: |
          R -e 'install.packages("rmarkdown")'

      - name: Render R Markdown with parameters
        run: |
          Rscript -e 'rmarkdown::render("path/to/your_file.Rmd", params = list(my_param = "value1"))'

3. Explanation of Workflow Steps
Checkout repository: This step checks out your repository so that the workflow can access your .Rmd file.
Set up R: This step installs R on the runner.
Install dependencies: Here, you install any necessary packages, including rmarkdown.
Render R Markdown with parameters: This step runs an R command to render the .Rmd file and passes the parameter my_param with a specific value (e.g., "value1").
4. Using Different Parameters
If you want to run the same script with different parameters in parallel, you can use a job matrix. Here’s how to modify the workflow:
text
name: Render R Markdown with Parameters

on:
  push:
    branches:
      - main

jobs:
  render:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        param_value: ["value1", "value2", "value3"]  # Different parameter values

    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: Set up R
        uses: r-lib/actions/setup-r@v2

      - name: Install dependencies
        run: |
          R -e 'install.packages("rmarkdown")'

      - name: Render R Markdown with parameter
        run: |
          Rscript -e 'rmarkdown::render("path/to/your_file.Rmd", params = list(my_param = "${{ matrix.param_value }}"))'

Summary
By following these steps, you can effectively pass parameters from a GitHub Actions runner to an R Markdown document. This setup allows for dynamic report generation based on different input values, enhancing automation and flexibility in your workflows.
SanderDevisscher commented 1 month ago
In Workflow A (Upload Artifact):
Create a job that generates the .csv file and uploads it as an artifact.
text
name: Workflow A
on: push

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Generate CSV
        run: |
          echo "Column1,Column2" > output.csv
          echo "Data1,Data2" >> output.csv

      - name: Upload CSV as artifact
        uses: actions/upload-artifact@v2
        with:
          name: my-csv-file
          path: output.csv

In Workflow B (Download Artifact):
Set up a workflow that triggers on the completion of Workflow A and downloads the artifact.
text
name: Workflow B
on:
  workflow_run:
    workflows: ["Workflow A"]
    types:
      - completed

jobs:
  download:
    runs-on: ubuntu-latest
    steps:
      - name: Download CSV artifact
        uses: actions/download-artifact@v2
        with:
          name: my-csv-file

      - name: Use CSV file
        run: |
          cat output.csv  # Replace with your processing logic.
SanderDevisscher commented 1 month ago

This is where I want to end up:

aspbo_flows-get_occ_cube drawio

SanderDevisscher commented 1 month ago

@damianooldoni I'm not an invited test user 😭 I get Error: Currently limited to invited test users when using occ_download_sql

SanderDevisscher commented 1 month ago

@damianooldoni should we look into publishing it on zenodo as well ?

damianooldoni commented 1 month ago

Yes, I think so. Do you apply something extra than binding rows and renaming some column names along the cubes? If yes, it seem at first glance overshooting. However, if we don't, we cannot:

I am just wondering if I could merge the SQL queries to have one big cube... I could do an attempt in November.

SanderDevisscher commented 1 month ago

@damianooldoni is there a reason to use continent = 'EUROPE' instead of countrycode = 'BE' in the used query?

SanderDevisscher commented 1 month ago

Yes, I think so. Do you apply something extra than binding rows and renaming some column names along the cubes? If yes, it seem at first glance overshooting. However, if we don't, we cannot:

Currently this flow only downloads the cubes and unifies them into 1 e.a. the github action flow exists out of 3 jobs: (1) build queries based on GRIIS checklist, (2) download cubes in parallel & finally (3) compiles the cubes into 1. Upon completion a second github action will be triggered to join the expanded grid info (commune, province, isFlanders, isWallonia, isBrussels), upload the modified cube to the S3 bucket of the alienspeciesportal & create the timeseries.

I could create a third github action which would be triggered upon completion of the first to upload the unified cube to ZENODO.

  • provide an unique DOI for reference
  • describe the steps to unify the cubes

I am just wondering if I could merge the SQL queries to have one big cube... I could do an attempt in November.

I've written my code to download the cubes in parallel in the hopes to remain within the limits of github actions.

SanderDevisscher commented 1 month ago

continent = 'EUROPE' or countrycode = 'BE' that's the question?

damianooldoni commented 1 month ago

I think I left continent = 'EUROPE' instead of countrycode = 'BE' because during TrIAS there was the need to running species modelling for a subset of species at tEU level (done by Amy Davis, post doc at UGent and working with @DiederikStrubbe). In that time I did by creating two different workflows. I have actually no idea if she ever used those cubes at continental level. I think you have to discuss this with @soriadelva who is/was working on Amy's scripts (I think so), @DiederikStrubbe and @timadriaens.

So, the choice is yours. Notice that you can leave the query at continent level and then filtering for Belgium. This cube could be useful for other researchers within or outside INBO and the B-Cubed project. Of course, it is many times bigger.

SanderDevisscher commented 1 month ago

I think I left continent = 'EUROPE' instead of countrycode = 'BE' because during TrIAS there was the need to running species modelling for a subset of species at tEU level (done by Amy Davis, post doc at UGent and working with @DiederikStrubbe). In that time I did by creating two different workflows. I have actually no idea if she ever used those cubes at continental level. I think you have to discuss this with @soriadelva who is/was working on Amy's scripts (I think so), @DiederikStrubbe and @timadriaens.

So, the choice is yours. Notice that you can leave the query at continent level and then filtering for Belgium. This cube could be useful for other researchers within or outside INBO and the B-Cubed project. Of course, it is many times bigger.

Ok I'll leave it for now, do some testing and if it's not too much of a delay I'll leave it as is.

Note to self, still to do:

SanderDevisscher commented 3 weeks ago

tests are ongoing see https://github.com/inbo/aspbo/actions/workflows/get_occ_cube.yaml

SanderDevisscher commented 3 weeks ago

dammit:

Error:
! A download limitation is exceeded:
User *** has too many simultaneous downloads; the limit is 3.
Please wait for some to complete, or cancel any unwanted downloads.  See your user page.
SanderDevisscher commented 3 weeks ago

Idea to fix issue:

library(rgbif)

num_active_downloads <- 3

while(num_active_downloads >=3){

# Get list of all downloads
downloads <- occ_download_list()

# Filter for active downloads (status is RUNNING or PREPARING)
active_downloads <- downloads[downloads$status %in% c("RUNNING", "PREPARING"), ]

# Get the count of active downloads
num_active_downloads <- nrow(active_downloads)

print(paste("Number of active downloads:", num_active_downloads))

Sys.sleep(60)

}
SanderDevisscher commented 2 weeks ago

This works: aspbo flow

SanderDevisscher commented 2 weeks ago

the flow works but there is something wrong with the cube preprocessing

SanderDevisscher commented 1 week ago

@damianooldoni I think there is something wrong with either the query or the rgbif function. I'll try to explain using Monomorium pharaonis (1314773) as an example. This species is present in all steps of the flow (queries, download & compiled cube) however none of the gridcells are in Belgium while the species has at least 53 occurrences in Belgium according to gbif. This means it get erroneously flagged as having no occurrences in Belgium. I can't figure out what goes wrong 🤔

damianooldoni commented 1 week ago

Common issues:

SanderDevisscher commented 1 week ago

Common issues:

  • the occurrences in Belgium are filtered out, e.g. they are unverified. You can see in the FILTER section of the SQL query which occurrences are filtered out at the moment.

I've noticed most of the eligible occurrences lack a verification status altogether therefor I've added OR identificationVerificationStatus IS NULL to the verification status filter. So the final filter looks something like:

AND (LOWER(identificationVerificationStatus) NOT IN (
      'unverified',
      'unvalidated',
      'not validated',
      'under validation',
      'not able to validate',
      'control could not be conclusive due to insufficient knowledge',
      'uncertain',
      'unconfirmed',
      'unconfirmed - not reviewed',
      'validation requested'
      ) OR identificationVerificationStatus IS NULL)

Indicating I want all occurrences which, if the verificationstatus is provided, are not equal to the list and those that have no verificationstatus.

  • coordinate uncertainty of the Belgian occurrences is so high that occurrences, randomly assigned within the uncertainty circle, fall in cells out of Belgium. I would add a filter on maximum allowed coordinate uncertainty, something in the range of 10km.

I've also added AND coordinateUncertaintyInMeters <= 10000 to the filter

🤞

SanderDevisscher commented 1 week ago

@damianooldoni by adding/modifying these filters the species is still not removed from the list without occurrences, but it resulted in an increase of "infected" gridcells from 15,558,095 to 15,563,674.

SanderDevisscher commented 1 week ago

@damianooldoni the base queries can be found here .data/input/.... Every rank has it own base rank in which the nubKeys are gsubbed. The queries used in the last run with a verification filter for all species can be found here https://github.com/inbo/aspbo/actions/runs/11890162263/artifacts/2200658319.

SanderDevisscher commented 1 week ago

@damianooldoni an update: Passing the species that were omitted with the verification status filter through a query with no verificationstatus filter made most of these species pass through. However still some issues remain (example Epuraea imperialis (Reitter, 1877) which has some eligible occurrences). Also this approach might be too loose and thus not preferred. Do you have any notion of limits for the occ_download_sql() function ? Like number of occurrences used or species provided ?

damianooldoni commented 1 week ago

Thanks @SanderDevisscher. Your nicely description of the issue and of your attempts will help me a lot. I will plan to solve this issue on Monday! I hope to come back here with a solution, an elegant one 😄

SanderDevisscher commented 11 hours ago

@damianooldoni did you find the cause of the issue ? and a solution ?

damianooldoni commented 10 hours ago

I worked on it on Monday and still didn't find it. Monday evening I did some other attempts but making cubes via GBIF took hours and so I couldn't move much further. I will book some time tomorrow. Sorry for the delay. One thing is sure: adding the condition about identificationVerificationStatu IS NULL helps to get way more occurrences included. Can you please send me the links of the artifacts with and without identificationVerificationStatu IS NULL in the SQL query? Thanks.

SanderDevisscher commented 10 hours ago

@damianooldoni thanks for trying 😄. The artifacts can be found on the bottom of these pages. "artifact" contains the queries. The other artifacts contain the resp. cube parts or compiled cube. With the identificationVerificationStatus IS NULL filter: https://github.com/inbo/aspbo/actions/runs/11889150140 Without the identificationVerificationStatus IS NULL filter: https://github.com/inbo/aspbo/actions/runs/11743019380

damianooldoni commented 10 hours ago

Thanks a lot.

Ah, forgotten to mention what I think about some of possible issues. I don't think there is any problem with the length of the query as if it was too long it will be truncated and will not pass any validation (= no download).