Merging data download with data manipulation

willpearse commented 6 years ago

I'm trying to get the data download stuff I did in the other repo here. The reason for this is I feel like we want to do it eventually (right?) and I realised I could be making the raster plots I discussed in #3 using totally different data.

Is this something you want to do? If you can tell me where the TPL synonymy lists and GBIF dumps you're working with in your code come from (I'm guessing Dylan's list and what I grabbed, respectively) then I'm happy to make all the changes and you can review everything.

willpearse commented 6 years ago

Don't just click merge without reading the above as otherwise I have a horrible feeling I'll just mess up what you've already done

wcornwell commented 6 years ago

I had to do the gbif data prep out of the remake workflow because the giant cached objects were causing me to overflow my storage limit on the cluster. It was also really awkward handling the paths correctly given the cluster weirdness. All I did after your steps was to filter to only rows that matched TPL accepted names... Will have a look

willpearse commented 6 years ago

Right, that makes sense. I think we ought to drop remake for this, to be honest, as the files are so huge. rake is basically doing the same thing anyway.

I like what you did with the if(linux)-esque bits. If you can make it load where the file "should" be if it's not on linux, then I'm pretty sure I can hack up a way around the problem.

As I said on email, if this isn't helpful to do, then honestly just tell me to sod off. I would love to get this done, and I don't want to be giving the appearance of being helpful while I'm really just irritating you :D

wcornwell commented 6 years ago

Cool. I think I understand your rake thing but it's pretty new to me.

Maybe if you get it working on your machine on a branch, then I can merge it back in once it's fully working for you?

willpearse commented 6 years ago

Right. The problem I'm having is I don't know what your files map onto, which makes it tricky for me to get it working.

Anyway, I'll plug away a bit more and see what I can figure out, then I'll ping you again. Thanks Will!

wcornwell commented 6 years ago

Ok, there are still some obsolete functions in there from the previous iterations. I am traveling today but can do some organizing tomorrow...

wcornwell commented 6 years ago

OK. Dug back into this. Basically this got a bit messy because I can use rake on my local machine fine, but I don't yet have correct permissions/module installed on the cluster. I can sort that out if we go completely to that workflow for reproducibility....certainly it would reduce the number of necessary hacks.

As for processing "../clean_data/gbif_spp_clean.txt" to produce ""../../../srv/scratch/z3484779/overlap-data/raw_data/cooked.csv" all I did was stick the genus and species names together, and filter to only accepted species names.

For filtering to accepted species, it looks like I was using a file I made from Beth's original scraping a few years ago, but if we want to switch to a cannonical source, this might the easiest?

Reproducibility on this is so tricky because all the intermediate files from GBIF are so giant, but can sort it out...

Hope that helps. Let me know if it's useful to Skype and work through these issues faster.

willpearse commented 6 years ago

This really did help, thank you. I knew I was missing something, and this was it.

I'm now running the data prep step from scratch, having got the individual bits to work. Once I've done that, I'll see about plugging in your analysis scripts.

I've had trouble with the file sizes as well, but fingers crossed this is getting there. I think I will make the thing check, or at least warn the user, that they need to have 100Gb of hard drive space minimum to run the damn thing...

willpearse commented 6 years ago

I've now run what, I hope, is enough to get this merged together.

The things to note here are:

Throughout your code I've tried to replace references to things with references to what I think are the same files in the output from the rake stuff.
The end of the file Rakefile contains what I think should happen in order to run your analyses. The general pattern is to spin up R, load all your R files, and then run whatever function I think it is that you were using.
It looks like you're using the output from a run of Dylan's code to find species mismatches using regexp. I feel like that's not what we agreed to do (i.e., it's name checking), but then again I have now made a list of things in GBIF and TPL, so if you want it then let me know and I'll add the matching step into the code too.

Hopefully this helps? Again, if not, let me know.

wcornwell commented 6 years ago

It looks like you're using the output from a run of Dylan's code to find species mismatches using regexp.

That's what we did last time, and there is still code to do that in the repo, but it could be removed now (e.g function gbif_tpl). (It doesn't even work with the new and bigger GBIF file.)

If you look at get_gbif function it just uses the synonymy file to get a list of unique accepted species and directly filter the gbif file with out doing any synonym replacement. I think this makes sense to do for the GAM analysis, but I could be convinced otherwise. If we don't do it, we conflate the latitudinal distribution of bad names in GBIF with the latitudinal distribution of trait and genetic data among good names. Is that reasonable?

I'm working on getting rake to work on the cluster, and will try to remove all the old R code so we only have what is currently in use.

willpearse commented 6 years ago

Hooray for a merge! :D

I'm starting another issue to talk about synonymy...

traitecoevo / sTEP_overlap

Merging data download with data manipulation #5