Some comments - Githubissues

Pakillo commented 10 years ago

Hi Nick and Dave,

I see the proto-ms has evolved a lot in the last week and is almost ready for submission ;-). It's also becoming quite technical and I can't contribute much. But, as promised, I'll leave some comments just in case they are useful.

It seems now clear that prevalence and thus absolute probability of presence are hardly identifiable from presence-only data unless we have rather large sample sizes and we make rather strong (and mostly unjustified) assumptions (e.g. Phillips & Elith 2013; Hastie & Fithian 2013). I think that in many SDM applications obtaining the absolute probability of presence is not that relevant, hence it may be fine and even better to give up and proceed with just relative probabilities or 'suitabilities'.
Recent papers by Warton and others have shown that point process models are the way to go with presence-only data. Actually, trying to fit these data in a presence-absence/pseudoabsence/background framework (i.e. logistic regression and alike) may have taken us in the wrong direction (Chakraborty et al. 2011), with many studies trying to decide how best to select these pseudoabsences-background points while we know they can strongly affect model results. Chakraborty et al (2011) go as far as to say that "presence-only data are not inferior to presence-absence data. In fact, it is the converse; presence-only data offer a complete census whereas presence-absence data, since confined to a specified set of sampling sites, contain less information".
Regarding the selection of pseudoabsence-background points, you propose in the 'Incorporating prevalence' section to use a number of background points according to the expected prevalence of the species. However, I understood from Warton et al (2010) and Barbet-Massin et al (2012) and others that we should use a really large number of background or pseudoabsence points (even though weighted to account for prevalence). Could you clarify this point a bit more?
Even if Maxent is equivalent to a point process model, fitting models in the latter framework (i.e. GLM and their many derivatives) brings many advantages (such as dealing with spatial effects, uncertainty propagation, etc - see e.g. Chakraborty et al. 2011) that are not present in Maxent. Also, Maxent logistic output is problematic and using model raw probabilities may be preferred in many cases.
I haven't thought about the mathematical details, but couldn't we estimate probability of presence from the estimated intensity of a point process model? That is, if we have an idea of the expected number of occurrences per unit area, can't we estimate the probability to find at least one presence in a cell of given area? For instance, Chakraborty et al (2011) were able to estimate species richness per grid cell from estimated intensities - could something similar work for probability of presence of a single species? I suspect there is some caveat in between... perhaps it would be interesting to discuss it in the ms.

Nothing else by now... Hope this helps somehow, and good luck with the ms!

Cheers,

Paco

davharris commented 10 years ago

These are great comments. Sorry it took me a while to get to them.

I agree.
I disagree strongly. Any sampling method is going to be confined to the sites that were explored. With presence-absence data, we know what sites were explored. The only difference with presence-only data is that we "forget" what sites were explored. We can assume that sites were explored randomly, but that's going to be approximate at best. If presence-only data were better, we could always achieve it by collecting presence-absence data and then dropping the absences.
I prefer using a large number of background points as well. I'd be up for removing this section if Nick is okay with it.
I agree that the flexibility to use the software and models of one's choice is a major advantage of point process models over the MaxEnt software. The specific reasons you mentioned are good examples of the advantages, and there will probably be many more.
I don't think this would work without extremely strong assumptions (like in MaxLike) or prevalence information. I haven't seen Chakraborty's paper, though.

goldingn commented 10 years ago

Hi guys, I can't believe it took me almost a month to to get to this - I'm really sorry.

First, Paco - thanks so much for sharing your thoughts!

I mostly agree. In my area the absolute probability of presence does seem to be useful for a number of things, not least for determining the thresholds and decision criteria which public health policy makers often want. That said, I don't really know how useful it could be in broader SDM applications. I wonder what proportion of the 54% of MAXENT users who interpreted the output as probability of presence in that Yackulic study went on to use it in that respect or just reported it incorrectly? Possibly something we could quantify.
a) I agree that point processes are probably going to be the best likelihoods for many applications of POSDM. However they don't get (automatically) get around the observation bias issue, which seems to me to be the single biggest problem with current POSDM practice. I.e. a point process which treats all non-presences as equal will be just as bad in that respect as a naive logistic regression-type model with random placement of background points. Thinned point processes seem the way to go, though the thinning is as subjective as pseudo-absence placement - so it isn't a panacea.

b) I think the discussion over which type of data is best isn't particularly helpful; I've never been in the situation where there was any no choice in the matter. I'm actually working on some disease mapping at the moment where I have data from actively-collected disease prevalence (i.e. planned surveys but only in areas where the disease is suspected, analagous to presence/absence) and the locations of passively-reported cases (i.e. subject to variable reporting rates but everywhere, analagous to presence-only data). Neither is particularly useful alone, but modelling them jointly means I can quantify and model those variable reporting rates at the same time as the disease rates - which is really useful in this situation. Obviously doing active surveys in wider areas would be optimal, but like I say, I have no say in that.

Yup, as you increase background points you get more information about the background and asymptotically approach some optimum, so lots of points is good. Fortunately, you can do the naive correction thing as well as using loads of background points by using regression weights. You just take your 10,000 or so background points and (down)weight each so that the weights sum to the expected number of absences. That said I'd be happy to remove advocacy for this approach if you think it should be avoided, though I think it would be helpful to make clear the effect of this ratio of presence/background on the resulting prediction - I've not seen a good, clear reference for this.
Agreed! We definitely need more development in this area.
Yup, the probability of observing at least one should just be one minus the CDF of the predictive distribution (e.g. poisson with estimated parameter lambda) evaluated at one (i.e. in R: 1 - ppois(1, lambda)). This would be a good estimate in a situation where you observe all (or almost all) the occurrences (e.g. your training data is the locations of trees, where you have a complete survey as your training data) but if you only observe only a fraction of the individuals then you don't stand a chance at estimating the true, absolute probability of presence. You would need a very good model of detection probability to make up for it and that would require additional data.

goldingn / POSDM_review

Some comments #5