inbo / tutorials

A collection of technical tutorials for INBO (and anyone who's interested)
https://tutorials.inbo.be/
Creative Commons Attribution 4.0 International
22 stars 9 forks source link

Tutorials needed on connecting to and using INBO data from R #37

Open florisvdh opened 6 years ago

florisvdh commented 6 years ago

INBO hosts many databases, so tutorials are needed for scientific and scientific support staff on how to use them from within R.

These tutorials should focus on:

There is an urgent need for the following INBO databases:

Further, the need is also urgent for these databases:

This list is most probably incomplete. (So having an overview of data is another, related need.)

stijnvanhoey commented 6 years ago

With respect to GIS-data, the available dbase of mercator-vonet should be in scope ass well. I cite here the information provided by Carine Wils:

INBO netwerk connecteren naar de nieuwe Read only databank van MercatorNet (= PostGIS databank). Daarin zitten een groot deel van de (vector)lagen van AIV (vroegere AGIV), maar ook van ANB – INBO – dep OMG (ex-dLNE+dRV) – OVAM – MOW – Erfgoed – VMM, ed. Het gros van deze lagen is ook beschikbaar op onze S-schijf, maar dan in de vorm van compressed FileGDB's, dus niet zo handig voor niet-ArcGIS gebruikers.

Toegang tot deze databank is mogelijk via je milieuinfo account (diegene waarmee je inlogt op JIRA), maar je moet daarvoor wel een mail sturen naar help@milieuinfo.be met vraag om toegang tot de RO databank Mercator.

Als extra ondersteuning voor de gebruikers zijn er van elke dataset op de nieuwe databank connectie/legende bestanden aangemaakt (LYR en QLR). Deze zijn te vinden op de MercatorNet website: https://www.mercator.vonet.be/intern-ontsloten-data-lyr-qlr en zijn hier: https://github.com/MercatorNet/MercatorUpload in één keer te downloaden.

Meer info op deze website: https://www.mercator.vonet.be.

Voor de geïnteresseerden, de connectieparameters zijn als volgt:

• Database: mercator_ro_productie
• User: <milieuinfo user productie>
• Password: <eigen paswoord>
• Instance: 5613
• Hostname: db-pr.mercator.vonet.be
peterdesmet commented 6 years ago

Just tested DBI/odbc and rodbc for a project:

Best for INBO would be to provide examples that can be added to the connections pane in RStudio. Or bundle it into something that can easily be installed for everyone?

florisvdh commented 6 years ago

With respect to odbc and DBI, some information from two talks at useR!2017 can be found here. (google doc; will jump to the odbc talk; DBI is next)

stijnvanhoey commented 5 years ago

at https://inbo.github.io/tutorials/tutorials/r_database_access/ a introduction is provided about R access to databases

florisvdh commented 5 years ago

Woohaa! This rocks. I'll certainly test this further, for now I'll propose a few tweaks here or there.

ElsLommelen commented 5 years ago

Sorry for a late response, but I only now noticed this issue.

With respect to RODBC and DBI, our styleguide recommends to use RODBC to connect to SQL Server and Access, and DBI to connect to other databases. For me as a Windows user, this seems the right guideline to follow: the odbc-driver that is used with DBI for SQL Server on Windows causes a lot of difficulties with special characters. (Apparently Linux machines have a better driver and do not have these problems.)

As @peterdesmet already mentions: the documentation in RODBC is better, and to be honest: I think the error handling in DBI is terrible. Often I have to paste my query in SQL Server Management Studio to find out what is wrong.

For databases without any special characters, it could be ok to suggest to use DBI (e.g. advantage to use dbplyr), but otherwise I would recommend to test first in a Windows environment if all characters can be imported well. The majority of the INBO users has Windows, so I think choices should mainly be based on this OS.

hansvancalster commented 5 years ago

A way to circumvent problems with special characters, yet still use DBI is given in this function. The input parameter scient_name in that function often contains special characters. This parameter is passed on to glue_sql with {} and afterwards converted to the correct encoding with iconv before passing it to dbGetQuery(). The problem was solved by @Jo-Loos.

ElsLommelen commented 5 years ago

As you know, I knew this solution, and it works indeed for some special characters, but unfortunately not for all (e.g. characters for male, female, <= in one character,...). (The examples indeed have an alternative and can be replaced in the database, but there will probably pop up others without alternative that give the same problem.)

But what actually worries me, is that this all makes it more complicated for starting or occasional R programmers to use DBI. The whole process of connecting to a database and running a query is already complicated, and then they have to add another statement to translate these characters that are not presenting well,... Furthermore, we already searched for weeks to find this (partial) solution, they might already search for some time before they figure out it is not their mistake (which can be demotivating) and ask for a solution.

So my main concern was not if there is a solution, but: if we know there are some difficulties with DBI that will make things complicated for less experienced users and there is an alternative (RODBC), will we still recommend DBI for using with SQL Server? What is actually the advantage of using DBI over RODBC? Except for the presence of a connection pane in RStudio, I don't see any arguments to choose for DBI in the discussion of this issue.

So please convince me with some good arguments why DBI is prefered over RODBC.

(I don't know why @ThierryO mentioned RODBC as a recommended package for SQL Server in the styleguide. It seems he wanted to promote DBI (as he did for most database types), but for some reason he made an exception for SQL Server and Access.)

hansvancalster commented 5 years ago

I was just pointing to that solution because it matters a lot where in your code the conversion to the correct encoding is done. For instance, encoding the input parameter and then passing it to glue_sql will not work.

I don't have strong opinions, about DBI or RODBC, but I find exploring the database via the connection pane a big plus and it is considered the "best in class" package by rstudio.

florisvdh commented 5 years ago

Something else to consider here is that the tidyverse package dbplyr is designed to work with DBI connections (see next coding club).

florisvdh commented 4 years ago

Fast forward to 2020: we now have the inbodb package to address some challenges! It would be good to have a pointer to inbodb in the tutorials website, both in the database-R tutorial and in the overview article of INBO software. @ElsLommelen can you give your opinion? (you're best placed to implement this as well)

Also the INBOVEG tutorial will need some updating @ElsDeBie

Beside that, I think we should further investigate / specify the need for functionality for specific databases, i.e. database-specific queries and preprocessing through R functions. This relates to the second sub-part of this issue, i.e. "best practices using each specific database".

ElsLommelen commented 4 years ago

Some ideas for the database-R tutorial:

INBO software: I'll add inbodb and a short description

I'll start working on a PR with the first idea for the database tutorial and the software and I'll wait on some more opinions for the other ideas. Maybe it would be better if someone with experience on it would add something on dbplyr (if necessary)?

Maybe it is a good idea to involve @hansvancalster in this discussion as well? (For the tutorials on Florabank and INBOVEG)

ElsLommelen commented 4 years ago

Considering the INBOVEG tutorial: maybe it better fits as a vignet in inbodb?

florisvdh commented 4 years ago

Fine ideas @ElsLommelen !

maybe change the url https://inbo.github.io/tutorials/tutorials/r_database_access/

perhaps r_database_sqlserver ? MS Access better goes into its own tutorial.

is there a tutorial on dbplyr, or should this be added here?

You could simply refer to https://dbplyr.tidyverse.org/index.html

split it in 2 (or 3) tutorials: one on a connection and one on queries

If you would do that, I'd prefer two tutorials rather than three, and which should clearly refer to each other at the beginning. Otherwise, the interconnection gets lost (the website doesn't provide tutorial-hierarchy) and in such case I think it's better to keep it together as one.

ElsLommelen commented 4 years ago

perhaps r_database_sqlserver? MS Access better goes into its own tutorial.

Because the general use is similar for all database types, I am not sure the best option is to give each database type its own tutorial.

Also, maybe (part of) the database tutorial should better be part of a vignet in inbodb... I feel the whole idea should have some more considerations. For now I will make some minor changes to direct people to the right package, and I will have a closer look at it later on.

florisvdh commented 4 years ago

I am not sure the best option is to give each database type its own tutorial

Sure, common parts better aren't duplicated.

Regarding the tutorials-website vs. vignette, I leave such preference to you, it is not that obvious to me what is best. It's certainly a good thing to have self-contained help within the package. If the tutorial (or parts of it) were to move to the inbodb package, I suggest to still leave a 'tutorial' here that directs to the inbodb package for more information.

ElsLommelen commented 4 years ago

Part of the tutorial describes methods not using inbodb, so this will stay on the tutorial website anyway. Other parts can directly link to the relevant inbodb vignet.

ElsLommelen commented 4 years ago

Another issue with inbodb: where does it fit best on the INBO software website? It certainly belongs to retrieve data, but it covers partly the technical aspect (connecting to databases) and it partly queries databases. For now it only queries databases with biological data (and is put in this section), but as part of the INBO databases contain environmental data, the package inbodb may in future contain queries on environmental data as well. Considering all this, where does it fit best?

hansvancalster commented 4 years ago

Maybe in the table add a row retrieve data: general, or do not make that distinction in the table and just lump them in one row retrieve data? In the remainder of the article, the distinction is useful (and I guess also in the table, so I am in favor of adding a row).

florisvdh commented 4 years ago

I agree with Hans to add a row Retrieve data: general, above the current ones on that topic. You can then add a new subtitle in the text, under 'Retrieve data' and put inbodb there.

ElsDeBie commented 4 years ago

Are there any good sites about 'how to write a vignette'? I quickly searched and find this: https://kbroman.org/pkg_primer/pages/vignettes.html http://r-pkgs.had.co.nz/vignettes.html For tutorials, we have our: https://inbo.github.io/tutorials/create_tutorial/. I can imagine that we have to write more vignettes, perhaps it's also useful to make a tutorial for this? Or are there generel R rules?

florisvdh commented 4 years ago

Hi @ElsDeBie, I found Hadley Wickham's book very helpful; see the vignettes chapter: http://r-pkgs.had.co.nz/vignettes.html

ElsLommelen commented 4 years ago

Euh, no idea. Easiest to generate a vignette is using the function usethis::use_vignette() and specify a name for your vignette. (This function deals with all settings and the heading, you only have to push these changes to github, and of course add content to your vignette.) That's about it for the technical part, I think.

And for the content, to me it seems the same as any other text: focus on a (potential) user group for your package and write down what they should really know to use your package. I generally start with writing the goal of the vignette (and the focal user group), and try to use a lot of examples.

florisvdh commented 4 years ago

Hm, I should have looked better, you've listed r-pkgs already :wink: Anyway, thanks for the tutorial of Broman, seems interesting - didn't know it.

ElsLommelen commented 4 years ago

@ElsDeBie In the link by Hadley Wickham, just replace package name devtools with usethis if the function is unknown: some functions have moved to usethis some years ago, and for some months these functions are effectively removed from devtools. Unfortunately a lot of documentation (even the develop cheatsheet) is not yet updated with these changes.

florisvdh commented 4 years ago

There is a 2nd edition of the book under development at https://r-pkgs.org. It seems that the authors did already make updates to the vignettes chapter at https://r-pkgs.org/vignettes.html.

ElsDeBie commented 4 years ago

Els Lommelen: > Another issue with inbodb: where does it fit best on the INBO software website? It certainly belongs to retrieve data, but it covers partly the technical aspect (connecting to databases) and it partly queries databases. For now it only queries databases with biological data (and is put in this section), but as part of the INBO databases contain environmental data, the package inbodb may in future contain queries on environmental data as well. Considering all this, where does it fit best?

--> Is it an idea to put between brackets the databases mentioned in this package? Otherwise users can be lost if they don't no where to search for ..."Retrieve data: general | inbodb (technical support, Inboveg, FLoradatabank)". Or maybe beter to put a third column with 'concerning databases'

https://inbo.github.io/tutorials/articles/inbo_software/

florisvdh commented 4 years ago

@ElsDeBie Adding the explicit databases has both pros (more informative) and cons (needs maintenance). @ElsLommelen what do you prefer?

Adding the explicit databases (in the description) was actually her first approach; I suggested generalization :thinking: (see https://github.com/inbo/tutorials/pull/171#discussion_r447543421).

Regarding the form: I would not insert package-specific information in the table, as the table intends to give a very global overview. Package-specific information is given in the text below. @ElsDeBie Maybe you can propose a better category name than 'Retrieve data: general', that would be a better fit ?

Still another possibility could be to repeat the package under more than 1 category in the table.

ElsLommelen commented 4 years ago

@ElsDeBie I agree with @florisvdh and I would rather prefer to not mention specific databases for maintenance reasons. When adding new databases to inbodb, it is easily added in for instance the readme of inbodb itself, but keeping the tutorials website up-to-date as well (and especially remembering to do so whenever we make changes in inbodb), will require an additional effort. On the other hand it would be nice to name all these specific databases (inboveg and florabank for now) in the readme of inbodb, so visitors will immediately have this information while using the link in the tutorials website.