PhyloNext is the automated pipeline for the analysis of phylogenetic diversity using GBIF occurrence data, species phylogenies from Open Tree of Life, and Biodiverse software.
Current pipeline brings together two critical research data infrastructures, the Global Biodiversity Information Facility (GBIF) and Open Tree of Life (OToL), to make them more accessible to non-experts.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
The pipeline could be launched in a cloud environment (e.g., the Microsoft Azure Cloud Computing Services, Amazon AWS Web Services, and Google Cloud Computing Services).
An example command to run the pipilene:
nextflow run vmikk/phylonext -r main \
--input "/mnt/GBIF/Parquet/2022-01-01/occurrence.parquet/" \
--classis "Mammalia" --family "Felidae,Canidae" \
--country "DE,PL,CZ" \
--minyear 2000 \
--dbscan true \
--phytree $(realpath "${HOME}/.nextflow/assets/vmikk/phylonext/test_data/phy_trees/Mammals.nwk") \
--iterations 100 \
-resume
To facilitate easy and efficient navigation for exploring the PhyloNext pipeline, a user-friendly, web-based graphical user interface (GUI) has been developed by Thomas Stjernegaard Jeppesen.
The GUI is available at https://phylonext.gbif.org/.
NB! To access the GUI, users must have a GBIF user account. To register an account, please visit https://www.gbif.org/.
The PhyloNext pipeline comes with documentation about the pipeline usage at https://phylonext.github.io/.
Main pipeline parameters and output are desribed here:
To show a help message, run nextflow run vmikk/phylonext -r main --help
.
=====================================================================
PhyloNext: GBIF phylogenetic diversity pipeline : Version 1.4.0
=====================================================================
Pipeline Usage:
To run the pipeline, enter the following in the command line:
nextflow run vmikk/phylonext -r main --input ... --outdir ...
Options:
REQUIRED:
--input Path to the directory with parquet files (GBIF occurrcence dump)
--outdir The output directory where the results will be saved
OPTIONAL:
--phylum Phylum to analyze (multiple comma-separated values allowed); e.g., "Chordata"
--classis Class to analyze (multiple comma-separated values allowed); e.g., "Mammalia"
--order Order to analyze (multiple comma-separated values allowed); e.g., "Carnivora"
--family Family to analyze (multiple comma-separated values allowed); e.g., "Felidae,Canidae"
--genus Genus to analyze (multiple comma-separated values allowed); e.g., "Felis,Canis,Lynx"
--specieskeys Custom list of GBIF specieskeys (file with a single column, with header)
--phytree Custom phylogenetic tree
--taxgroup Specific taxonomy group in Open Tree of Life (default, "All_life")
--phylabels Type of tip labels on a phylogenetic tree ("OTT" or "Latin")
--maxage Manually assign root age for a tree obtained from Open Tree of Life; e.g., 127
--phyloonly Prune Open Tree tips for which there are no phylogenetic inputs; logical, default, false
--country Country code, ISO 3166 (multiple comma-separated values allowed); e.g., "DE,PL,CZ"
--latmin Minimum latitude of species occurrences (decimal degrees); e.g., 5.1
--latmax Maximum latitude of species occurrences (decimal degrees); e.g., 15.5
--lonmin Minimum longitude of species occurrences (decimal degrees); e.g., 47.0
--lonmax Maximum longitude of species occurrences (decimal degrees); e.g., 55.5
--minyear Minimum year of record's occurrences; default, 1945
--maxyear Maximum year of record's occurrences; default, none
--coordprecision Coordinate precision threshold (less than maximum allowed value; default, 0.1)
--coorduncertainty Maximum allowed coordinate uncertainty, meters (default, 10000)
--coorduncertaintyexclude Black list of coordinate uncertainty values (default, "301,3036,999,9999")
--basisofrecordinclude Basis of record to include from the data; e.g., "PRESERVED_SPECIMEN"
--basisofrecordexclude Basis of record to exclude from the data; e.g., "FOSSIL_SPECIMEN,LIVING_SPECIMEN"
--polygon Custom area of interest (a file with polygons in GeoPackage format)
--wgsrpd Polygons of World Geographical Regions; e.g., "pipeline_data/WGSRPD.RData"
--regions Names of World Geographical Regions; e.g., "L1_EUROPE,L1_ASIA_TEMPERATE"
--noextinct File with extinct species specieskeys for their removal (file with a single column, with header)
--excludehuman Logical, exclude genus "Homo" from occurrence data (default, true)
--roundcoords Numeric, round spatial coordinates to N decimal places, to reduce the dataset size (default, 2; set to negative to disable rounding)
--h3resolution Spatial resolution of the H3 geospatial indexing system; e.g., 4
--dbscan Logical, remove spatial outliers with density-based clustering; e.g., "false"
--dbscannoccurrences Minimum species occurrence to perform DBSCAN; e.g., 30
--dbscanepsilon DBSCAN parameter epsilon, km; e.g., "700"
--dbscanminpts DBSCAN min number of points; e.g., "3"
--terrestrial Land polygon for removal of non-terrestrial occurrences; e.g., "pipeline_data/Land_Buffered_025_dgr.RData"
--rmcountrycentroids Polygons with country and province centroids; e.g., "pipeline_data/CC_CountryCentroids_buf_1000m.RData"
--rmcountrycapitals Polygons with country capitals; e.g., "pipeline_data/CC_Capitals_buf_10000m.RData"
--rminstitutions Polygons with biological institutuions and museums; e.g., "pipeline_data/CC_Institutions_buf_100m.RData"
--rmurban Polygons with urban areas; e.g., "pipeline_data/CC_Urban.RData"
--deriveddataset Prepare a list of DOIs for the datasets used (default, true)
--indices Comma-seprated list of diversity and endemism indices; e.g., "calc_richness,calc_pd,calc_pe"
--randname Randomisation scheme type; e.g., "rand_structured"
--iterations Number of randomisation iterations; e.g., 1000
--biodiversethreads Number of Biodiverse threads; e.g., 10
--randconstrain Polygons to perform spatially constrained randomization (GeoPackage format)
Leaflet interactive visualization:
--leaflet_var Variables to plot; e.g., "RICHNESS_ALL,PD,SES_PD,PD_P,ENDW_WE,SES_ENDW_WE,PE_WE,SES_PE_WE,CANAPE,Redundancy"
--leaflet_canapesuper Include the `superendemism` class in CANAPE results (default, false)
--leaflet_color Color scheme for continuous variables (default, "RdYlBu")
--leaflet_palette Color palette for continuous variables (default, "quantile")
--leaflet_bins Number of color bins for continuous variables (default, 5)
--leaflet_sescolor Color scheme for standardized effect sizes, SES (default, "threat"; alternative - "hotspots)
--leaflet_redundancy Redundancy threshold for hiding the grid cells with low number of records (default, 0 = display all grid cells)
Static visualization:
--plotvar Variables to plot (multiple comma-separated values allowed); e.g., "RICHNESS_ALL,PD,PD_P"
--plottype Plot type
--plotformat Plot format (jpg,pdf,png)
--plotwidth Plot width (default, 18 inches)
--plotheight Plot height (default, 18 inches)
--plotunits Plot size units (in,cm)
--world World basemap
NEXTFLOW-SPECIFIC:
-qs Queue size (max number of processes that can be executed in parallel); e.g., 8
-w Path to the working directory to store intermediate results (default, "./work")
-resume Execute the pipeline using the cached results.<br>Useful to continue executions that was stopped by an error
-profile Configuration profile; e.g., "docker"
-params-file Parameter file in YAML or JSON format (e.g., "Mammals.yaml")
-c / -C Configuration file (`-C` ignores all default values) (default, "nextflow.config")
Source code for the documentation can be found at https://github.com/PhyloNext/phylonext.github.io.
PhyloNext pipeline was developed by Vladimir Mikryukov and Kessy Abarenkov.
Biodiverse program and Perl scripts accompanying PhyloNext were written by Shawn Laffan (Laffan et al., 2010).
Scripts for getting an induced subtree from the Open Tree of Life were developed by Emily Jane McTavish.
We thank the following people for their extensive assistance in the development of this pipeline: Joe Miller, Shawn Laffan, Tim Robertson, Emily Jane McTavish, John Waller, Thomas Stjernegaard Jeppesen, and Matthew Blissett.
Also we are very grateful to Manuele Simi and nf-core community for helpful advices on the development of this pipeline.
For more details, please see the Acknowledgments section in the docs.
The work is supported by a grant “PD (Phylogenetic Diversity) in the Cloud” to GBIF Supplemental funds from the GEO-Microsoft Planetary Computer Programme.
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to file an issue on GitHub.
Shifter
or Charliecloud
containers.If you use PhyloNext pipeline for your analysis, please cite it as:
Mikryukov V, Abarenkov K, Laffan S, Robertson T, McTavish EJ, Jeppesen TS, Waller J, Blissett M, Kõljalg U, Miller JT (2024). PhyloNext: A pipeline for phylogenetic diversity analysis of GBIF-mediated data. BMC Ecology and Evolution, 24(1), 76. DOI:10.1186/s12862-024-02256-9
Laffan SW, Lubarsky E, Rosauer DF (2010) Biodiverse, a tool for the spatial analysis of biological and related diversity. Ecography, 33: 643-647. DOI: 10.1111/j.1600-0587.2010.06237.x
An extensive list of references for the tools used by the pipeline can be found in the Citations section in the documentation.