Closed willpearse closed 6 years ago
Don't just click merge without reading the above as otherwise I have a horrible feeling I'll just mess up what you've already done
I had to do the gbif data prep out of the remake workflow because the giant cached objects were causing me to overflow my storage limit on the cluster. It was also really awkward handling the paths correctly given the cluster weirdness. All I did after your steps was to filter to only rows that matched TPL accepted names... Will have a look
Right, that makes sense. I think we ought to drop remake
for this, to be honest, as the files are so huge. rake
is basically doing the same thing anyway.
I like what you did with the if(linux)
-esque bits. If you can make it load where the file "should" be if it's not on linux, then I'm pretty sure I can hack up a way around the problem.
As I said on email, if this isn't helpful to do, then honestly just tell me to sod off. I would love to get this done, and I don't want to be giving the appearance of being helpful while I'm really just irritating you :D
Cool. I think I understand your rake thing but it's pretty new to me.
Maybe if you get it working on your machine on a branch, then I can merge it back in once it's fully working for you?
Right. The problem I'm having is I don't know what your files map onto, which makes it tricky for me to get it working.
Anyway, I'll plug away a bit more and see what I can figure out, then I'll ping you again. Thanks Will!
Ok, there are still some obsolete functions in there from the previous iterations. I am traveling today but can do some organizing tomorrow...
OK. Dug back into this. Basically this got a bit messy because I can use rake
on my local machine fine, but I don't yet have correct permissions/module installed on the cluster. I can sort that out if we go completely to that workflow for reproducibility....certainly it would reduce the number of necessary hacks.
As for processing "../clean_data/gbif_spp_clean.txt" to produce ""../../../srv/scratch/z3484779/overlap-data/raw_data/cooked.csv" all I did was stick the genus and species names together, and filter to only accepted species names.
For filtering to accepted species, it looks like I was using a file I made from Beth's original scraping a few years ago, but if we want to switch to a cannonical source, this might the easiest?
Reproducibility on this is so tricky because all the intermediate files from GBIF are so giant, but can sort it out...
Hope that helps. Let me know if it's useful to Skype and work through these issues faster.
This really did help, thank you. I knew I was missing something, and this was it.
I'm now running the data prep step from scratch, having got the individual bits to work. Once I've done that, I'll see about plugging in your analysis scripts.
I've had trouble with the file sizes as well, but fingers crossed this is getting there. I think I will make the thing check, or at least warn the user, that they need to have 100Gb of hard drive space minimum to run the damn thing...
I've now run what, I hope, is enough to get this merged together.
The things to note here are:
rake
stuff.Rakefile
contains what I think should happen in order to run your analyses. The general pattern is to spin up R, load all your R files, and then run whatever function I think it is that you were using.Hopefully this helps? Again, if not, let me know.
It looks like you're using the output from a run of Dylan's code to find species mismatches using regexp.
That's what we did last time, and there is still code to do that in the repo, but it could be removed now (e.g function gbif_tpl
). (It doesn't even work with the new and bigger GBIF file.)
If you look at get_gbif
function it just uses the synonymy file to get a list of unique accepted species and directly filter the gbif file with out doing any synonym replacement. I think this makes sense to do for the GAM analysis, but I could be convinced otherwise. If we don't do it, we conflate the latitudinal distribution of bad names in GBIF with the latitudinal distribution of trait and genetic data among good names. Is that reasonable?
I'm working on getting rake to work on the cluster, and will try to remove all the old R code so we only have what is currently in use.
Hooray for a merge! :D
I'm starting another issue to talk about synonymy...
I'm trying to get the data download stuff I did in the other repo here. The reason for this is I feel like we want to do it eventually (right?) and I realised I could be making the raster plots I discussed in #3 using totally different data.
Is this something you want to do? If you can tell me where the TPL synonymy lists and GBIF dumps you're working with in your code come from (I'm guessing Dylan's list and what I grabbed, respectively) then I'm happy to make all the changes and you can review everything.