Open SanderDevisscher opened 1 year ago
@damianooldoni what is the status on downloading the species cubes directly from gbif ?
Hi @SanderDevisscher: unfortunately this is not planned for 2023 yet, my mistake. But they can create cubes on demand if you need them.
I will aks for an occurrence cube. Are these below your requirements?
CONSTRAINTS:
GRANULARITY:
space: 1x1km EEA grid
If they will not be able to provide it (very unlikely), then I will produce it as I did the last years.
Hi @SanderDevisscher: unfortunately this is not planned for 2023 yet, my mistake. But they can create cubes on demand if you need them.
too bad, would be nice to write it as it would become.
I will aks for an occurrence cube. Are these below your requirements?
CONSTRAINTS:
- taxonomic: everything
- time: everything
- space: Belgium
GRANULARITY:
- taxonomic: taxa as mentioned in the GRIIS checklist (some are species, some subspecies, some are genus...)
- time: year
- space: 1x1km EEA grid
I think so, I essentially need an update, new species due to updated of GRIIS checklist & new observations due to passing of time, of the alienSpecies cubes be_alientaxa_cube.csv
& be_classes_cube.csv
as done in trias-project/occ-cube-alien.
If they will not be able to provide it (very unlikely), then I will produce it as I did the last years.
Thanks
FYI, you can follow my request to GBIF here: https://github.com/gbif/occurrence-cube/issues/3
Het heeft iets meer geduurd dan verwacht, maar het was de moeite waard denk ik 🙂 Github page: https://damianooldoni.github.io/b3cubes-sql-examples/ GitHub repo: https://github.com/damianooldoni/b3cubes-sql-examples
Zoals vermeld in homepage zelf, idee is om deze documentatie toevoegen aan officiële https://docs.b-cubed.eu/. Ik heb hierover issue geopend (#44).
Enjoy cubing!
posted by @damianooldoni on 2024-09-11 in google chat
Update: there is a rgbif function to trigger SQL API: occ_download_sql()
🥳
I would not change the written tutorials. I would add a new; short one, I suppose.
Thanks for the update 😄
To pass parameters from a GitHub Actions runner to an R Markdown (.Rmd) file, you can utilize the params feature in R Markdown along with the rmarkdown::render() function in your GitHub Actions workflow. Here’s how to set it up:
Step-by-Step Setup
1. Modify Your R Markdown Document
First, ensure your R Markdown file is set up to accept parameters. You can define parameters in the YAML header of your .Rmd file like this:
text
---
title: "Parameterized Report"
output: html_document
params:
my_param: NULL
---
```{r}
# Use the parameter in your analysis
cat("The parameter value is:", params$my_param, "\n")
text
### 2. **Create or Edit Your GitHub Actions Workflow**
Next, create or edit a workflow YAML file in your repository (e.g., `.github/workflows/render-rmd.yml`). Below is an example of how to set up the workflow to pass parameters to the R Markdown document:
```yaml
name: Render R Markdown
on:
push:
branches:
- main
jobs:
render:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Set up R
uses: r-lib/actions/setup-r@v2
- name: Install dependencies
run: |
R -e 'install.packages("rmarkdown")'
- name: Render R Markdown with parameters
run: |
Rscript -e 'rmarkdown::render("path/to/your_file.Rmd", params = list(my_param = "value1"))'
3. Explanation of Workflow Steps
Checkout repository: This step checks out your repository so that the workflow can access your .Rmd file.
Set up R: This step installs R on the runner.
Install dependencies: Here, you install any necessary packages, including rmarkdown.
Render R Markdown with parameters: This step runs an R command to render the .Rmd file and passes the parameter my_param with a specific value (e.g., "value1").
4. Using Different Parameters
If you want to run the same script with different parameters in parallel, you can use a job matrix. Here’s how to modify the workflow:
text
name: Render R Markdown with Parameters
on:
push:
branches:
- main
jobs:
render:
runs-on: ubuntu-latest
strategy:
matrix:
param_value: ["value1", "value2", "value3"] # Different parameter values
steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Set up R
uses: r-lib/actions/setup-r@v2
- name: Install dependencies
run: |
R -e 'install.packages("rmarkdown")'
- name: Render R Markdown with parameter
run: |
Rscript -e 'rmarkdown::render("path/to/your_file.Rmd", params = list(my_param = "${{ matrix.param_value }}"))'
Summary
By following these steps, you can effectively pass parameters from a GitHub Actions runner to an R Markdown document. This setup allows for dynamic report generation based on different input values, enhancing automation and flexibility in your workflows.
In Workflow A (Upload Artifact):
Create a job that generates the .csv file and uploads it as an artifact.
text
name: Workflow A
on: push
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Generate CSV
run: |
echo "Column1,Column2" > output.csv
echo "Data1,Data2" >> output.csv
- name: Upload CSV as artifact
uses: actions/upload-artifact@v2
with:
name: my-csv-file
path: output.csv
In Workflow B (Download Artifact):
Set up a workflow that triggers on the completion of Workflow A and downloads the artifact.
text
name: Workflow B
on:
workflow_run:
workflows: ["Workflow A"]
types:
- completed
jobs:
download:
runs-on: ubuntu-latest
steps:
- name: Download CSV artifact
uses: actions/download-artifact@v2
with:
name: my-csv-file
- name: Use CSV file
run: |
cat output.csv # Replace with your processing logic.
This is where I want to end up:
@damianooldoni I'm not an invited test user 😠I get Error: Currently limited to invited test users
when using occ_download_sql
@damianooldoni should we look into publishing it on zenodo as well ?
Yes, I think so. Do you apply something extra than binding rows and renaming some column names along the cubes? If yes, it seem at first glance overshooting. However, if we don't, we cannot:
I am just wondering if I could merge the SQL queries to have one big cube... I could do an attempt in November.
@damianooldoni is there a reason to use continent = 'EUROPE'
instead of countrycode = 'BE'
in the used query?
Yes, I think so. Do you apply something extra than binding rows and renaming some column names along the cubes? If yes, it seem at first glance overshooting. However, if we don't, we cannot:
Currently this flow only downloads the cubes and unifies them into 1 e.a. the github action flow exists out of 3 jobs: (1) build queries based on GRIIS checklist, (2) download cubes in parallel & finally (3) compiles the cubes into 1. Upon completion a second github action will be triggered to join the expanded grid info (commune, province, isFlanders, isWallonia, isBrussels), upload the modified cube to the S3 bucket of the alienspeciesportal & create the timeseries.
I could create a third github action which would be triggered upon completion of the first to upload the unified cube to ZENODO.
- provide an unique DOI for reference
- describe the steps to unify the cubes
I am just wondering if I could merge the SQL queries to have one big cube... I could do an attempt in November.
I've written my code to download the cubes in parallel in the hopes to remain within the limits of github actions.
continent = 'EUROPE'
or countrycode = 'BE'
that's the question?
I think I left continent = 'EUROPE'
instead of countrycode = 'BE'
because during TrIAS there was the need to running species modelling for a subset of species at tEU level (done by Amy Davis, post doc at UGent and working with @DiederikStrubbe). In that time I did by creating two different workflows. I have actually no idea if she ever used those cubes at continental level. I think you have to discuss this with @soriadelva who is/was working on Amy's scripts (I think so), @DiederikStrubbe and @timadriaens.
So, the choice is yours. Notice that you can leave the query at continent level and then filtering for Belgium. This cube could be useful for other researchers within or outside INBO and the B-Cubed project. Of course, it is many times bigger.
I think I left
continent = 'EUROPE'
instead ofcountrycode = 'BE'
because during TrIAS there was the need to running species modelling for a subset of species at tEU level (done by Amy Davis, post doc at UGent and working with @DiederikStrubbe). In that time I did by creating two different workflows. I have actually no idea if she ever used those cubes at continental level. I think you have to discuss this with @soriadelva who is/was working on Amy's scripts (I think so), @DiederikStrubbe and @timadriaens.So, the choice is yours. Notice that you can leave the query at continent level and then filtering for Belgium. This cube could be useful for other researchers within or outside INBO and the B-Cubed project. Of course, it is many times bigger.
Ok I'll leave it for now, do some testing and if it's not too much of a delay I'll leave it as is.
Note to self, still to do:
cube_preprocessing.Rmd
tests are ongoing see https://github.com/inbo/aspbo/actions/workflows/get_occ_cube.yaml
dammit:
Error:
! A download limitation is exceeded:
User *** has too many simultaneous downloads; the limit is 3.
Please wait for some to complete, or cancel any unwanted downloads. See your user page.
Idea to fix issue:
library(rgbif)
num_active_downloads <- 3
while(num_active_downloads >=3){
# Get list of all downloads
downloads <- occ_download_list()
# Filter for active downloads (status is RUNNING or PREPARING)
active_downloads <- downloads[downloads$status %in% c("RUNNING", "PREPARING"), ]
# Get the count of active downloads
num_active_downloads <- nrow(active_downloads)
print(paste("Number of active downloads:", num_active_downloads))
Sys.sleep(60)
}
This works: aspbo flow
the flow works but there is something wrong with the cube preprocessing
@damianooldoni I think there is something wrong with either the query or the rgbif function. I'll try to explain using Monomorium pharaonis (1314773) as an example. This species is present in all steps of the flow (queries, download & compiled cube) however none of the gridcells are in Belgium while the species has at least 53 occurrences in Belgium according to gbif. This means it get erroneously flagged as having no occurrences in Belgium. I can't figure out what goes wrong 🤔
Common issues:
FILTER
section of the SQL query which occurrences are filtered out at the moment.Common issues:
- the occurrences in Belgium are filtered out, e.g. they are unverified. You can see in the
FILTER
section of the SQL query which occurrences are filtered out at the moment.
I've noticed most of the eligible occurrences lack a verification status altogether therefor I've added OR identificationVerificationStatus IS NULL
to the verification status filter. So the final filter looks something like:
AND (LOWER(identificationVerificationStatus) NOT IN (
'unverified',
'unvalidated',
'not validated',
'under validation',
'not able to validate',
'control could not be conclusive due to insufficient knowledge',
'uncertain',
'unconfirmed',
'unconfirmed - not reviewed',
'validation requested'
) OR identificationVerificationStatus IS NULL)
Indicating I want all occurrences which, if the verificationstatus is provided, are not equal to the list and those that have no verificationstatus.
- coordinate uncertainty of the Belgian occurrences is so high that occurrences, randomly assigned within the uncertainty circle, fall in cells out of Belgium. I would add a filter on maximum allowed coordinate uncertainty, something in the range of 10km.
I've also added AND coordinateUncertaintyInMeters <= 10000
to the filter
🤞
@damianooldoni by adding/modifying these filters the species is still not removed from the list without occurrences, but it resulted in an increase of "infected" gridcells from 15,558,095 to 15,563,674.
@damianooldoni the base queries can be found here .data/input/.... Every rank has it own base rank in which the nubKeys are gsubbed. The queries used in the last run with a verification filter for all species can be found here https://github.com/inbo/aspbo/actions/runs/11890162263/artifacts/2200658319.
@damianooldoni an update:
Passing the species that were omitted with the verification status filter through a query with no verificationstatus filter made most of these species pass through. However still some issues remain (example Epuraea imperialis (Reitter, 1877) which has some eligible occurrences). Also this approach might be too loose and thus not preferred. Do you have any notion of limits for the occ_download_sql()
function ? Like number of occurrences used or species provided ?
Thanks @SanderDevisscher. Your nicely description of the issue and of your attempts will help me a lot. I will plan to solve this issue on Monday! I hope to come back here with a solution, an elegant one 😄
@damianooldoni did you find the cause of the issue ? and a solution ?
I worked on it on Monday and still didn't find it. Monday evening I did some other attempts but making cubes via GBIF took hours and so I couldn't move much further. I will book some time tomorrow. Sorry for the delay. One thing is sure: adding the condition about identificationVerificationStatu IS NULL
helps to get way more occurrences included.
Can you please send me the links of the artifacts with and without identificationVerificationStatu IS NULL
in the SQL query? Thanks.
@damianooldoni thanks for trying 😄.
The artifacts can be found on the bottom of these pages. "artifact" contains the queries. The other artifacts contain the resp. cube parts or compiled cube.
With the identificationVerificationStatus IS NULL
filter: https://github.com/inbo/aspbo/actions/runs/11889150140
Without the identificationVerificationStatus IS NULL
filter: https://github.com/inbo/aspbo/actions/runs/11743019380
Thanks a lot.
Ah, forgotten to mention what I think about some of possible issues. I don't think there is any problem with the length of the query as if it was too long it will be truncated and will not pass any validation (= no download).
create full_timeseries.csv
- workflow to trigger when download workflow was succesful, see trigger code: