meringlab / MicrobeAtlasWebsite

6 stars 0 forks source link

Coordinates in samples.env.info table does not match what is shown on the website #12

Closed ddlawton closed 1 year ago

ddlawton commented 1 year ago

The coordinates provided (and code to produce the map) in the samples.env.info are shown below:

library(tidyverse)
library(sf)
library(rnaturalearth)

url <- "https://microbeatlas.org/downloads/samples/samples.env.info"

download.file(url,
              destfile = basename(url), method="curl", extra="-k")

dat <- read_delim("samples.env.info",col_names=FALSE) %>%
  select(X1,X9) %>%
  rename(sample_id = "X1",coordinates = "X9") %>%
  drop_na(coordinates) %>%
  separate(coordinates,into=c('latitude','longitude'),sep=" ") %>%
  filter(between(longitude,-180,180),between(latitude,-90,90)) %>% # potential UTM coordinates error (~200 points)
  st_as_sf(coords=c("longitude","latitude"), crs=4326) # Assuming WGS84 -- create simples features object

world <- ne_countries(returnclass = "sf") %>% select(geometry) # Getting outline of continents

dat %>%
  ggplot(aes(geometry=geometry)) +
  geom_sf(data=world,aes(geometry=geometry)) +
  geom_sf(pch=21,size=0.2,alpha=.6) +
  theme_void() 

image

There are some weird patterns in the point distribution. Such as the entire western side of the US being underrepresented.

These weird patterns do not match what I can see on the website. For example, here is the distribution of Micrococcales which clearly shows more points in the western United States

Are these points unavailable intentionally or am I able to access these points in a different way?

Thanks!

ddlawton commented 1 year ago

Just looking at the original file. Many of the samples are there (e.g. https://microbeatlas.org/index.html?action=sample_detail&sid=SRS1056245&rid=SRR2233313) however the coordinates column is left blank.

library(tidyverse)

url <- "https://microbeatlas.org/downloads/samples/samples.env.info"

download.file(url,
              destfile = basename(url), method="curl", extra="-k")

dat <- read_delim("samples.env.info",col_names=FALSE) %>%
  select(X1,X9) %>%
  rename(sample_id = "X1",coordinates = "X9") 

dat %>% filter(str_detect("SRS1056245", sample_id))
grexor commented 1 year ago

Dear Douglas,

Thanks for reaching out! I checked a little bit the parsing code, and the lat/lon is actually extracted from the samples.info file in a, how to say, non-trivial way with regex etc.

@jfmrod @MCDanaila perhaps it would be possible to provide a separate file with (sample, geo) info on the download page?

Hope this helps, Gregor

jfmrod commented 1 year ago

Hi, thanks for pointing that out. Indeed there was a mismatch between the data on the website and the data available for download. I've fixed the discrepancy. The data available for download should now match the one shown on the website.

ddlawton commented 1 year ago

Thanks!