Using get_download() comment

andydawson commented 10 years ago

I find it a bit confusing that when there is only a single site returned, you still need to index into the first element. For example, if I do

foo <- get_download(datasetid=1665)

Then when I do names(foo) I get NULL. Also when I just try to see the object by typing foo, I don't see anything. Not a huge deal, but I use Neotoma and R all the time and was a bit confused by this so thought I'd comment. Maybe if there is something in the list but we are suppressing printing we could print a message instead? Something like

Printing suppressed due to size of object

Just a thought.

SimonGoring commented 10 years ago

Yes, thanks for making this an issue. I know it's a /problem, but I'm not entirely sure how to fix it. On Sep 16, 2014 6:03 PM, "Andria Dawson" notifications@github.com wrote:

I find it a bit confusing that when there is only a single site returned, you still need to index into the first element. For example, if I do

foo <- get_download(datasetid=1665)

Then when I do names(foo) I get NULL. Also when I just try to see the object by typing foo, I don't see anything. Not a huge deal, but I use Neotoma and R all the time and was a bit confused by this so thought I'd comment. Maybe if there is something in the list but we are suppressing printing we could print a message instead? Something like

Printing suppressed due to size of object

Just a thought.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/neotoma/issues/139.

gavinsimpson commented 10 years ago

There is a bug here in the single download case:

> foo[[1]]$metadata$site.data$sitename
[1] "Lost Lake (US:Michigan)"

but the print method doesn't show a site name:

> foo
A single download object for site 
Accessed 2014-09-16 19:56h.

Which is because the site name is accessed using CamelCase --- ...$site.data$SiteName:

x[[1]]$metadata$site.data$SiteName

And perhaps this is time to harmonise the nameing of components. The above call illustrates three different approaches to naming components:

all lowercase; $metadata
period separator; $site.data
upper camel case; $SiteName

@andydawson What would you like to see printed? The print method seems to work reasonably similarly for both single and multiple downloads. Something is printed in the example you give. We could easily add an id or similar as the names attribute of the download list for example. Anything else?

SimonGoring commented 10 years ago

I agree with the standardization of the object names, although metadata is properly one word, so the separation by a period shouldn't be a problem. I prefer all lowercase, spaces represented by periods, so I suppose that's the next find & destroy.

I think the problem that @andydawson is indicating is the fact that:

onesite <- get_download(7203)
onesite

returns a site, and a printed method:

A single download object for site 
Accessed 2014-09-16 21:12h.

but

> onesite[[1]]
Download from Marion Landfill.
Accessed on 2014-09-16 21:12h.

and to actually get at the data (for the single site) in onesite you have to call the first list object, even though there's only one download in the list.

gavinsimpson commented 10 years ago

We could add a drop = FALSE argument to get_download(). The rationale behind having a single dowload act like a a multiple download object of length 1 is practicality. One can write code around such an object and assume it is a list with each element a download object. I see this as a feature. In vegan for example, we had the score() methods return a list of scores if you asked for multiple types of scores. If you asked for one type you got a data frame, not a list containing that data frame. This always caught me out and I'm one of the maintainer of that package(!) because it was inconsistent.

However, I see the utility in making it easier or at least more user-friendly. Hence drop; if set as drop = TRUE then we'd just return the single download not in a list of downloads. Would that make more sense?

SimonGoring commented 10 years ago

Right, that sounds good. I actually just tried to fix the problem by pulling the single download out from the list and realized it causes a whole bunch of problems in the other methods, for exactly the reason you describe. It's nice that my lazy coding is a feature.

So, the trade-off is that it's not entirely intuitive when a single site is used, but in coding, the code behaves similarly whether one or multiple sites are returned without a bunch of exception catching.

@andydawson, what do you think?

On Tue, Sep 16, 2014 at 9:21 PM, Gavin Simpson notifications@github.com wrote:

We could add a drop = FALSE argument to get_download(). The rationale behind having a single dowload act like a a multiple download object of length 1 is practicality. One can write code around such an object and assume it is a list with each element a download object. I see this as a feature. In vegan for example, we had the score() methods return a lit of scores if you asked for multiple types of scores. If you asked for one type you got a data frame, not a list containing that data frame. This always caught me out and I'm one of the maintainer of that package(!) because it was inconsistent.

However, I see the utility in making it easier or at least more user-friendly. Hence drop; if set as drop = TRUE then we'd just return the single download not in a list of downloads. Would that make more sense?

— Reply to this email directly or view it on GitHub https://github.com/ropensci/neotoma/issues/139#issuecomment-55840752.

andydawson commented 10 years ago

Yes @SimonGoring that is what I was referring to. I do get the printed statement indicating that something was downloaded, and am fine with that (except for the missing sitename which @gavinsimpson pointed out). I wonder if it maybe it would help to only suppress printing when the object is really big. If I could see the object, then there would be no issue. Or maybe there is a flag that I missed somewhere?

I agree that standardization would be good. I like all lower case.

gavinsimpson commented 10 years ago

@SimonGoring I notice another inconsistency here:

class(foo) [1] "download" "list"
class(foo[[1]]) [1] "download" "list"

foo really shouldn't be the same class as the objects it holds. How about call the outer object, the list containing the downloads, class download_list. We'll need to change the print method (add a new one) and check this doesn't break other code down the line.

gavinsimpson commented 10 years ago

@andydawson There's a problem special-casing the single download printing of the object - it is big, even in cases where there aren't many proxies. Try print(unclass(foo[[1]])) for example.

So the question is, what data should we display? I'm talking here about the single download. We could also provide a str() method which limits the information printed to the main levels to just:

> str(foo[[1]], max = 1)
List of 6
 $ metadata    :List of 5
 $ sample.meta :'data.frame':   33 obs. of  11 variables:
 $ taxon.list  :'data.frame':   46 obs. of  5 variables:
 $ counts      :'data.frame':   33 obs. of  46 variables:
 $ lab.data    : logi NA
 $ chronologies:List of 1
 - attr(*, "class")= chr [1:2] "download" "list"

We could also provide summary() methods if more detail is required beyond the print method.

It would be good to scope out exactly what info would be desirable in each case so we can think about how to display this to the user.

andydawson commented 10 years ago

Ok - the objects are pretty big :) . Printing an entire object is not helpful after all...

I'm not sure additional methods are necessary - do you think people would use that? If it's too annoying to be able to have the object such that new_foo would be equivalent to the current foo[[1]] when there is only a single site downloaded then I can live with it as is (with the sitename bug fixed eventually).

gavinsimpson commented 10 years ago

@andydawson

If it's too annoying to be able to have the object such that new_foo would be equivalent to the current foo[[1]]

We could do this but the only way I'm aware that we could would be to attach an extra new class to the object if there is only a single download in the object. Then we'd need to write methods for [, [[, $ (at least) which fakes the fact that you are working with foo[[1]] not foo when you try to subset. This is a chunk more work but probably nothing more than

`[.download_single` <- function(x, i) {
    class(x) <- class(x)[-1] ## assuming c("download_single", "download", "list")
    x[i]
}

and so on. Would also need [<- and $<- methods if we expect/allow people to assign in to these objects.

However, this is still inconsistent for the user; whilst the object quacks like a duck it isn't one.

i wonder if the above is worth it when we can just instruct users to do

foo1 <- foo[[1]]
foon <- foo[[n]]

to get the 1st, nth download object.

andydawson commented 10 years ago

I think it's better to be consistent, and instruct users :) . And the code would be less hacky because there wouldn't be the special cases.

SimonGoring commented 10 years ago

Yes, @GavinSimpson, that's sort of where I was going in thinking how to implement it, and I was getting mighty sad. I think for now it;s best just to instruct users, like @andydawson says. It's not as elegant from a user's point of view. But it's consistent.

On Wed, Sep 17, 2014 at 10:23 AM, Andria Dawson notifications@github.com wrote:

I think it's better to be consistent, and instruct users :) . And the code would be less hacky because there wouldn't be the special cases.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/neotoma/issues/139#issuecomment-55910403.

SimonGoring commented 10 years ago

Okay, at this point I think we're largely cleared up, except for the issue #141 @gavinsimpson raises, which is great.

Right now the addition of download and download_list objects breaks up printing, so that print.download_list returns:

> aa <- get_download(100:103)
API call was successful. Returned record for Site 30 (Swain unpublished)
API call was successful. Returned record for Site 31 (Swain unpublished)
API call was successful. Returned record for Site 32 (Swain unpublished)
API call was successful. Returned record for Site 33 (Swain unpublished)
> aa
A download_list containing 4 objects:
Accessed from 2014-09-27 20:33h to 2014-09-27 20:33h. 
Datasets:
 dataset.id                   site.name   long   lat age.younger age.older
        100 Site 30 (Swain unpublished) -91.35 46.35          NA        NA
        101 Site 31 (Swain unpublished) -91.72 46.05          NA        NA
        102 Site 32 (Swain unpublished) -91.80 46.22          NA        NA
        103 Site 33 (Swain unpublished) -91.58 45.63          NA        NA

and the implementation of print.download returns:

> aa[[1]]
A download object for Site 30 (Swain unpublished)
Accessed 2014-09-27 20:33h.

For the print.download we get the dataset.id, and the site names, along with age ranges for the samples.

This then transitions to the point about the extractor for metadata, which is moot (I think) now that the naming has been standardized. metadata is really a dataset, and so the extractor for that is get_dataset, which you can use like this:

> aa
A download_list containing 4 objects:
Accessed from 2014-09-27 20:33h to 2014-09-27 20:33h. 
Datasets:
 dataset.id                   site.name   long   lat age.younger age.older
        100 Site 30 (Swain unpublished) -91.35 46.35          NA        NA
        101 Site 31 (Swain unpublished) -91.72 46.05          NA        NA
        102 Site 32 (Swain unpublished) -91.80 46.22          NA        NA
        103 Site 33 (Swain unpublished) -91.58 45.63          NA        NA
> get_dataset(aa[['102']])

Of course, in doing that I realize that get_dataset doesn't have a good print method, but, having said that, it works. You could also get the site data in the same way, but using get_site.

So this was a big issue, but to summarize the resolution:

we're keeping downloads inside download_lists so for consistency you have to use download[[1]] even if you're only pulling a single object
the print function is cleaner and provides all the information we want right now for both download and download_list objects (although I'll add the site info to the print.download right now.
extracting datasets and sites is simplified with the new standardized nomenclature, because it is done using the existing get_site and get_dataset commands.

I don't think I missed anything, so I'm closing this. @andydawson @gavinsimpson?

ropensci / neotoma

Using get_download() comment #139