popphylotools / mvMapper

mvMapper (Multivariate Mapper) is an interactive data exploration tool for multivariate analyses with associated geographic location information.
https://popphylotools.github.io/mvMapper/
BSD 2-Clause "Simplified" License
9 stars 1 forks source link

spca example not displaying #74

Closed thibautjombart closed 6 years ago

thibautjombart commented 7 years ago

Not sure where it comes from, but the spca example from export_to_mvmapper does not seem to work:

    data(rupica)

     spca1 <- spca(rupica, type=5, d1 = 0, d2 = 2300,
                   plot = FALSE, scannf = FALSE,
                   nfposi = 2,nfnega = 0)

     info <- data.frame(key = indNames(rupica),
                        lat = rupica$other$xy[,2],
                        lon = rupica$other$xy[,1])

     out <- export_to_mvmapper(spca1, info, out_file = "test.csv")

Importing this file into mvmapper seems to work but no output is visible. test.csv.zip

jrdupuis commented 7 years ago

The lat/long values aren't in WGS84, so I think it's just breaking. That would be a good place for an error message! Will work on pulling together another example for spca (unless you happen to have an old analysis/dataset laying around Thibaut).

jrdupuis commented 7 years ago

Actually, for that matter, the dapc example using sim2pop isn't displaying either. Seems like a few of the lat/longs in that dataset are maybe too extreme (e.g. 96.55281517, 4.958245344 doesn't show up on a google map search). Maybe it would be easier to just have a mvmapper-specific canned dataset to use for this? I'm not sure if that's more trouble than it's worth. Thoughts?

thibautjombart commented 7 years ago

sim2pop is different, it is simulated and not meant to reflect existing locations; the sPCA example using the rupica dataset should ideally work, as these are genuine locations. Is there a way to test if coords are WGS84 (I sense the answer is no) and convert on the fly, or generate a warning?

jrdupuis commented 7 years ago

I'm not aware of a way to do that conversion on the fly in R (and doubt that it's possible, at least on the fly). But if we knew the CRS for the rupica dataset, we could reproject the coordinates and get them into lat long easily in the R example.

Do you happen to know what coordinate reference system is used in the rupica data? I did some digging around but cannot figure out what those supposed "easting" and "northing" values are. They are not normal UTM coordinates (don't match what the values would be in Bauges), and don't look like any other coordinate system I'm familiar with (granted, I don't know the European systems if they happen to have some special ones, and I'm really no GIS expert). If we can figure out the CRS for that dataset, then we can just add a few lines to the examples using the rupica data. e.g.

library(adegenet)
library(sp)
library(raster)
data(rupica)
coords <- rupica@other$xy
coords2 <-as.data.frame(coords)
coordinates(coords2) <- c("x","y")
crs1 <- "+proj=utm +zone=31 +ellps=WGS84 +datum=WGS84 +units=m +no_defs" # this would be the crs of the rupica dataset
proj4string(coords2) <- crs1
crs2 <- "+proj=longlat +zone=31 +ellps=WGS84 +datum=WGS84 +units=m +no_defs" # and we change it to this
coords3 <- spTransform(coords2,crs(crs2))
spca1 <- spca(rupica, type=5, d1 = 0, d2 = 2300, plot = FALSE, scannf = FALSE, nfposi = 2,nfnega = 0)
info <- data.frame(key = indNames(rupica), lat = coordinates(coords3)[,1], lon = coordinates(coords3)[,2])
out <- export_to_mvmapper(spca1, info, write_file = T)

That code actually works fine, and can be uploaded to mvmapper. Except the CRS is off, so it gets plotted in Africa rather than France. If we wanted to live with that, then we could. But it'd be nice if we could get the rupica dataset to play with lat/long. Let me know if you have any other ideas about the rupica coordinates, or is there another toy dataset in R that you know of with genetic and geographic data?

thibautjombart commented 7 years ago

Thanks a lot for looking into this. I had assumed the coords were regular longitude / latitude until now. After looking in the data, I have no idea what projection system was used. This collaboration was a hot mess and I am not even sure who provided the GIS layer, but both potential suspects left science a while ago. Best give up on this example maybe? Given these are real data, plotting things in a different place is not an option.

jrdupuis commented 7 years ago

I think giving up on that example sounds good. Do you have another toy dataset in mind that has lat long and genotypes?

I'm not sure what the process is for creating an R-accessible dataset like that, but we should be able to use either of the datasets that I used for figures in the paper (one is a big human microsatellite dataset, and the other is one of my swallowtail butterfly microsatellite datasets). Do you just upload an R object to the library? I think you would be better at pulling that all into R and doing so in the proper way, but I can pull together the raw inputs (genotypes, lat longs, etc.) and send those to you if that works.

thibautjombart commented 7 years ago

Good, let's give up on this example and find a better one then. We can do either:

  1. import all the data in R in the right format, save it as a .RData indata/ - it will be available via data(...) in adegenet
  2. store the raw files in adegenet and then show how to import data in the doc

They're not mutually exclusive. We have plenty of 1), and from the feedback I got on a course last week people want more example of 2). What format is the microsat data in?

In any case, we'll need to document the new data by adding roxygen2 info to: https://github.com/thibautjombart/adegenet/blob/master/R/datasets.R

Let me know if you can PR this or if you need help.

jrdupuis commented 7 years ago

I think I like the second option, which would allow us to use the example code from the paper for the most part. I have both inputs in structure format now, with the lat longs in a separate file, so will be easy to work with.

I'll pull together the files and code in the next several days, and send them to you. If you can then fill in the R-code-style components for the roxygen2 info, that would be great.

thibautjombart commented 7 years ago

Sounds good. The workflow would be:

jrdupuis commented 6 years ago

Hi Thibaut, Here's the .str and .csv for the Rosenberg dataset. Rosenberg.zip I'm not sure if I'm right about the system.file command, so could you double check that part?

# An example using the microsatellite dataset of Rosenberg et al. 2005 (1,048 individuals, 783 loci, doi: 10.1371/journal.pgen.0010070)
# Reading input file from adegenet
system.file("files", "Rosenberg_783msats.str", package="adegenet")
Rosenberg <- read.structure("Rosenberg_783msats.str", n.ind=1048, n.loc=783,  onerowperind=F, col.lab=1, col.pop=2, row.marknames=NULL, NA.char="-9", ask=F, quiet=F)

# DAPC (n.pca determined using xvalDapc, see ??xvalDapc)
dapc1 <- dapc(Rosenberg, n.pca=20, n.da=200)

# read in localities.csv, which contains "key", "lat", and "lon" columns with column headers (this example contains a fourth column "population" which is a text-based population name based on geography)
system.file("files", "Rosenberg_localities.csv", package="adegenet")
localities <- read.csv(file="Rosenberg_localities.csv", header=T)

# generate mvmapper input file, automatically write the output to a csv, and name the output csv "rosenbergData.csv" 
out <- export_to_mvmapper(dapc1, localities, write_file=TRUE, out_file="RosenbergData.csv")

I think having a single example should be sufficient, but if we want to include an sPCA, that would be great. I am not very familiar with that method, so if you think it'd be good to include an sPCA example, could you look at that dataset and try to come up with something that is statistically sound? Alternatively, the other dataset used as an example in the paper is a butterfly one that is at a finer geographic scale--I attached those datafiles here: Dupuis.zip

Let me know what you think!

thibautjombart commented 6 years ago

Hi Julian sorry, too many things on the stove atm. Can you poke me on this after the 22nd December? Re examples: only one should be good enough, but I'll keep itin mind re sPCA.

jrdupuis commented 6 years ago

Sounds good--I'll give you a poke after the 22nd!

thibautjombart commented 6 years ago

Hi Julian, thanks for the poke. I've looked into the examples, and Rosenberg's is going to be far too large a file to include in the package. The example with the butterfly data should be just fine. Your code looks fine except for system.file. This function returns the path to files stored in the /inst/ folder of the package. For instance, to find the local path to AFLP.txt you need:

my_file <- system.file("files/AFLP.txt", package = "adegenet")
x <- read.table(my_file) # in your case that will be read.structure(...)

If you can PR this that'd be brilliant.

jrdupuis commented 6 years ago

Hi Thibaut,

Thanks for the help with the system.file code. I think this block of code should be good to go for dealing with the butterfly dataset (still attached as Dupuis.zip above):

# An example using the microsatellite dataset of Dupuis et al. 2016 (781 individuals, 10 loci, doi: 10.1111/jeb.12931)
# Reading input file from adegenet
input_data <- system.file("files/Dupuis_10msats.str", package="adegenet")
butterflies <- read.structure(input_data, n.ind=781, n.loc=10,  onerowperind=F, col.lab=1, col.pop=2, row.marknames=NULL, NA.char="-9", ask=F, quiet=F)

# DAPC (n.pca determined using xvalDapc, see ??xvalDapc)
dapc1 <- dapc(butterflies, n.pca=40, n.da=200)

# read in localities.csv, which contains "key", "lat", and "lon" columns with column headers (this example contains additional columns containing species identifications, locality descriptions, and COI haplotype clades)
input_localities <- system.file("files/Dupuis_localities.csv", package="adegenet")
localities <- read.csv(input_localities, header=T)

# generate mvmapper input file, automatically write the output to a csv, and name the output csv "mvMapper_Data.csv" 
out <- export_to_mvmapper(dapc1, localities, write_file=TRUE, out_file="mvMapper_Data.csv")

Let me know if that all looks good, or if any other tweaks are needed (I assume "PR" just means "Please Revise"?). Thanks again!

thibautjombart commented 6 years ago

Hi Julian, looks good, but PR actually means 'pull request', which is the best mechanism for contributing to a github project. Basically, you need to:

  1. go the the adegenet github page, click on 'fork'; that will create a copy of the adegenet repo in your own github account
  2. clone that repo, make needed changes; test that the new code works after installing the new version (devtools::install())
  3. commit, push
  4. then go back to the adegenet github page, click on 'pull request' -> new pull request and voilà!

That will also run current integration tests automatically.

jrdupuis commented 6 years ago

Hey Thibaut, Ahh yeah of course. Thanks for putting up with my github naivete!

I've done a pull request to your master branch (not sure if you have a different, in-development branch for tweaks like this), and have tested the new files and code on my side, and all seems good. I did tweak some of the file names and code in an effort to match the other example datasets that you have in adegenet/data. Feel free to tweak them further if you'd like.

There is one thing that I can't get to fully work, which is getting the example code to show up through ??export_to_mvmapper or other "help" methods. I have the feeling that's due to my inexperience with writing R code, and that you'll be able to remedy it quickly. I've added this chunk of example code to the export_to_mvmapper.R:

#' @export
#' @rdname export_to_mvmapper
#' @examples
#' 
#' # An example using the microsatellite dataset of Dupuis et al. 2016 (781 individuals, 10 loci, doi: 10.1111/jeb.12931)
#' # Reading input file from adegenet
#' input_data <- system.file("data/swallowtails.rda", package="adegenet")
#' data(swallowtails)
#' 
#' # conducting a DAPC (n.pca determined using xvalDapc, see ??xvalDapc)
#' dapc1 <- dapc(butterflies, n.pca=40, n.da=200)
#' 
#' # read in swallowtails_loc.csv, which contains "key", "lat", and "lon" columns with column headers (this example contains additional columns containing species identifications, locality descriptions, and COI haplotype clades)
#' input_locs <- system.file("data/swallowtails_loc.csv", package="adegenet")
#' loc <- read.csv(input_locs, header=T)
#' 
#' # generate mvmapper input file, automatically write the output to a csv, and name the output csv "mvMapper_Data.csv" 
#' out <- export_to_mvmapper(dapc1, loc, write_file=TRUE, out_file="mvMapper_Data.csv")

Is another step needed to get it to be incorporated into adegenet/man/export_to_mvmapper.rd? Or maybe I didn't get the syntax right in the above code, to get it recognized as an "example"?

Anyhow, let me know if there's anything else to do on my end of things. Thanks again!

thibautjombart commented 6 years ago

This line: #' dapc1 <- dapc(butterflies, n.pca=40, n.da=200) caused an error, most lkely a typo - I assume it should read swallowtails After changing that the example works and I can use this in mvmapper. The doc is visible through ?export_to_mvmapper and example(export_to_mvmapper), but needed a devtools::document() to update the Rd file.

The csv file needs to be in inst/ and is causing an error when in data/. Fixing this now.

jrdupuis commented 6 years ago

Ahh yeah, a typo. Thanks for those fixes, and getting the files in the right places!

jrdupuis commented 6 years ago

All works well from a recent install of the dev version, so I'm going to close this issue. Thanks again Thibaut!