Where to go next and default columns

AshleyWoods commented 6 years ago

Now that the file download is running all the functionality discussed in our last meeting has been implemented. That leaves the question of what to do next.

I also wanted to ask about the choice of default columns. I picked the ones I thought may be the most useful, but I am unsure if I added ones that should be left off or left off ones that should be added. Could you look at the list of columns and the list of ones I picked to be the defaults? (They're all typed out in the code and I REALLY don't want to have to type them all again.) There are 235 columns total and I have picked out 29.

dmcglinn commented 6 years ago

hey @AshleyWoods here is a small set of polishing things todo on the app:

[x] downloaded file needs .csv on the end of its file name.
[x] a progress bar or at least a "the app is thinking indicator" would be lovely - I think these have been integrated in other shiny apps that exist online. Maybe even better the app could say waiting for query to complete, downloading or importing depending on what stage it was in.
[x] add ability for user to specify download key rather than creating a having to wait for a new query, see examples in ?occ_download_import
[x] move the column id specification out of the code and into a .csv file which will be easier to maintain and update. Just use R to read in the table and then allow uses to still specify their choice. You could decide to add a metadata column that describes each field as well. I bet GBIF has something like this we could borrow. Here is a first attempt at a table: https://www.dropbox.com/s/h3sbzpsd5nw0sbu/gbif_fields.csv?dl=0
[x] see documentation of occ_get which provides a minimal set of column ids to return. A small set is a good idea so that datasets don't get too huge.
[ ] rename repo something like rgbif_app because SOAR is a bit to generic for what our app actually does.
[x] add the rgbif documentation to the app on how to specify your GBIF username and password in your computer's enviornmental variable - this will be better for user security. See ?occurrence_download for instructions on this.
[x] when rendering the data only render the small subset of column ids the user selects.

On a broader scope we need to determine where we want to go next with this work. We have discussed a few different options here that I've listed below:

[ ] provide a suite of tools for down-sampling data dense areas so that more representative examinations of species occurrences across the globe can be carried out. This would effectively be allowing the user to specify a grid cell size and then a minimum cell density of occurrences to standardize the results to. The idea here is to correct for sampling biases and also to try to get closer to true species absences in the dataset.
[ ] carry out an example regional scale analysis on a few taxonomic groups and compare how gbif compares to local DNR datasets - this would be like a proof of concept of the feasibility of this approach for local DNR managers.
[ ] the shiny app wallace provides a means of carrying out species distribution modeling using similar data but not biodiveristy analyses. Your app could target biodiversity analyses which focus on numbers of species and number of total occurrences.

AshleyWoods commented 6 years ago

I'm down to just moving the specifications of the columns over to a .csv file and changing the name of the repo(on the list of small things). I am unsure of how to change the name of the repo as when I attempted it the app broke because it no longer had the right path to any of its files. I'm not sure I did it correctly though.

AshleyWoods commented 6 years ago

I am actually unsure of how to move column specification over to a .csv file either. However, I have picked the "broader scope" direction I'd like to pursue. I think tools for downsampling would be a good thing to have.

dmcglinn commented 6 years ago

Hey @AshleyWoods were you able to download the csv file of the field names I created? Essentially my idea was that you would simply add columns to that csv file for different sets of default sets of fields. A given set would specify which fields to include as those fields with 1's rather than 0's. Also a useful field to add to this csv file would be metadata describing what each field represents. This has to exist somewhere on the GBIF website or you should email them to find out where these are specified. Can you please describe exactly what the problem is with the csv file approach?

No worries on renaming the repo. I will do that with you when I get back to town. With regards to breaking the app or not - have we discussed how to use branches in git? Branches can be very helpful for giving you the freedom you need to make changes to the code but still ensure the app is working on the master branch. Many repos maintain a master and dev branch for this purpose, slowly merging dev to master when they are sure features are working.

Very cool on your interest to try to tackle geographic sampling bias. Please start combing the literature for this topic in the field of species-distribution modeling where I think the most has been written. I'll also try to send you a few key papers on this topic. Then we can start to brainstorm how to best do this.

AshleyWoods commented 6 years ago

That approach to the csv makes much more sense than what I was trying to do. Thank you! And as for the down-sampling, I found this package: https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/downSample I think it would be good to include this as a "quick down-sample" option and have a second place for input where people could specify grid cell size and minimum occurrences as more of a custom option.

dmcglinn commented 6 years ago

The nature of that function I think is kind of what we want but we need something to apply to a landscape of coordinates.

Here are a few papers, I'll update this list as I encounter more - please move these to the wiki in time: Concepts papers

Graham et al. (2004)
Boakes et al. (2010)
Meyer et al. (2016)

Methodological papers

Fourcade et al. (2014)
Boria et al. (2014)

AshleyWoods commented 6 years ago

I've added them to the wiki and I'll read over them. How do you create a branch? I'm trying to implement the .csv code and I don't want to break anything.

AshleyWoods commented 6 years ago

Nevermind, I believe I just found it.

AshleyWoods commented 6 years ago

I finished implementing it, but I am unsure of how to merge the branch back into the master branch.

dmcglinn commented 6 years ago

So the ideal way this would be done is for you to push that branch to GitHub and then us GitHubs pull request mechanism to provide a public review of your merge. This gives me an opportunity to provide code review. This may have been such a small change you don’t want a review so in that case you will just merge the branch into master locally and then push master to github.

Pull request option:

While on your branch

git push origin my_new_branch

Then go to github and click on your new branch there should be a pull request option then

Local merge option:

git checkout master git merge my_new_branch git push origin master

Dan

On Jun 27, 2018, at 11:08 PM, AshleyWoods notifications@github.com wrote:

I finished implementing it, but I am unsure of how to merge the branch back into the master branch.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

dmcglinn commented 6 years ago

One thing to note about the pull request option described above is that if you take this approach and we merge your new_branch into the remote master branch that your local master branch will be out of sync with the remote branch. You can resync them by doing the following:

git pull origin master

For completeness a work flow that includes forking and pull requests (the typical collaborative setup for using github) can be carried out this way (these instructions are for the repo MoBiodiv/mobr but they can be generalized to any repo):

1) Fork the repo to your local GitHub account

2) Clone your forked version of the repo to your machine

git clone git@github.com:your_user_name/mobr.git

3) Link your local repo back to the master on MoBiodiv

git remote add upstream git@github.com:MoBiodiv/mobr.git

4) Create a branch for your changes

git branch new_function

5) Checkout your branch

git checkout new_function

6) Make your commits on that branch and when you are done push it to your forked copy of the repo

git push origin new_function

7) Submit a pull request on the GitHub website by going to your forked copy of the repo and clicking on the pull request button

8) After your changes are merged with master you'll want to merge that update to master with your copies as well.

git pull upstream master
git push origin master
# delete your branch as its no longer needed
git branch -d new_function

Before your start work on the project in the future you'll want to repeat step 8 so that your version of the repo does not become out-of-sync with the main repository.

dmcglinn commented 6 years ago

I was thinking more about our vision for moving the app forward and it does seem like the literature has quite a bit of methods described for how to detect bias in the datasets. Maybe the simplest first step is to provide the user tools to detect bias in the chunk of data they extracted. Then the burden can be on them to decide what to do about it. Let me know what you think about this idea.

AshleyWoods commented 6 years ago

I like that idea. It also allows us to avoid altering the data in a way that the user doesn't want or like. (like you mentioned with the p values some programs give)

AshleyWoods commented 6 years ago

Oops, didn't mean to close the issue.

AshleyWoods commented 6 years ago

I am having issues finding more literature on the subject of correcting bias (the papers we have linked seem to have the most common/useful methods) and cannot seem to find any at all for detecting bias. All that comes up are papers that say "we need an easy way to detect bias" but offer no real solution.

mcglinnlab / soar

Where to go next and default columns #14