Reproducing Results - Githubissues

WibbletheDuck commented 7 years ago

Hello,

I'm working with Nextjournal, whose goals include the establishment of a platform for open and long-term reproducible research. The open nature of your paper led us to attempt porting it over as a case study: in how difficult it would be to get a pretty recent data processing and presentation code structure into working order anywhere, in the work required to bring such a body of code onto the Nextjournal platform in particular, and in accurately reproducing both the data and graphical outputs.

We've made significant progress, which you can take a look at here if you wish: https://nextjournal.com/mpd/how-much-of-the-world-is-woody/. However, I've recently noticed that my distribution results for the simulations (starting at the 'Distributions' code cell) are slightly off compared to the original article, both in shape and in absolute position on the x axis. The first example is just the first distribution plot: screen shot 2017-07-21 at 3 58 15 am comparing to the original article's plot at http://richfitz.github.io/wood/figure/distribution_raw.png mine appears to be shifted to the right by over 1% and also slightly less 'peaky'. I've checked as much of the obvious stuff as I can think of: the processing code appears identical (and in many cases is simply copy-pasted), the input data I'm using comes direct from the cache files in your repo, I've tried with the 'production' values of nrep=1000, etc. Everything I can think of seems to be identical, yet this result is off, and the later 'fraction by X' plots are also different in their details.

We've finally decided I should contact you about this, to see if you have any ideas on what could be wrong. The most obvious possibility I can think of is that the cache files in the repo might have slightly more up-to-date data from theplantlist and dryad than what was used in publication, but I welcome any other ideas you might have on possible sources of such variance.

Regards, Micah Dombrowski

richfitz commented 7 years ago

ping @mwpennell @wcornwell - any spare time to look at this at all?

wcornwell commented 7 years ago

Hmmmm interesting one: I just re-did it on a couple of machines and I'm getting the same distributions as was in the original paper:

I'll dig a bit deeper, but need more clues. Can you post the whole wood-ms-supporting.pdf doc that you're reproducing?

WibbletheDuck commented 7 years ago

@wcornwell Apologies for the slow response—some platform issues were being worked out over the past week.

We do not generate a PDF in our version of the article—a Nextjournal article aims to be a single entity that includes and preserves input data, perpetually runnable code, results, and expository text. The majority of the plots (all but the two that use diversitree) have been converted to use interactive Plotly functionality; however, I have checked that our issues aren't some sort of display problem: the raw output data shows differences.

The two most obvious differences are in the plot above (generated by the Distributions code cell) and the 'Fraction By' plots like those resulting from the Fraction by Order cell.

Can you confirm what raw input data you're using? I've tried both the dryad-cache.tar.gz and theplantlist-cache.tar.gz files from the github release and freshly downloaded data, and while they differ from each other neither looks like your results.

...Actually, whatever data you're using, could you possibly send me your generated RDS files (dat.g.rds, woodiness.rds, etc)? Comparing them to mine would help me narrow down where exactly things are diverging.

wcornwell commented 7 years ago

Cool! BTW I like the Nextjournal project a lot--very cool thing to do.

Anyway, that one above I used freshly downloaded data. I'm pretty confident the problem isn't with the underlying data.

The dryad data didn't (can't) change. And I'm 95% sure The Plant List v1.1 hasn't changed--they've stated they are not updating it. And I haven't found any changes for a related project that relies on those data pretty heavily. On the other hand, there very well might be issues related to R library versions and/or namespace issues. We already had one issue with changes in dplyr breaking things.

Anyway, here are the RDS files from my latest run.

From looking through your version, the replication of the simulations was reduced from 1000 to 100, but you were probably aware of that change already.

wcornwell commented 7 years ago

what I just got from another simulation:

wcornwell commented 7 years ago

> kable(summary(dat.g))
[1] "|   |   genus         |   family        |   order         |      W         |      V          |      H         |      N         |      K         |      p       |"
[2] "|:--|:----------------|:----------------|:----------------|:---------------|:----------------|:---------------|:---------------|:---------------|:-------------|"
[3] "|   |Length:15252     |Length:15252     |Length:15252     |Min.   :  0.000 |Min.   : 0.00000 |Min.   :  0.000 |Min.   :   1.00 |Min.   :  0.000 |Min.   :0.000 |"
[4] "|   |Class :character |Class :character |Class :character |1st Qu.:  0.000 |1st Qu.: 0.00000 |1st Qu.:  0.000 |1st Qu.:   1.00 |1st Qu.:  0.000 |1st Qu.:0.000 |"
[5] "|   |Mode  :character |Mode  :character |Mode  :character |Median :  0.000 |Median : 0.00000 |Median :  0.000 |Median :   3.00 |Median :  0.000 |Median :1.000 |"
[6] "|   |NA               |NA               |NA               |Mean   :  1.498 |Mean   : 0.03311 |Mean   :  1.046 |Mean   :  20.73 |Mean   :  2.545 |Mean   :0.556 |"
[7] "|   |NA               |NA               |NA               |3rd Qu.:  1.000 |3rd Qu.: 0.00000 |3rd Qu.:  0.000 |3rd Qu.:  11.00 |3rd Qu.:  1.000 |3rd Qu.:1.000 |"
[8] "|   |NA               |NA               |NA               |Max.   :622.000 |Max.   :43.00000 |Max.   :497.000 |Max.   :2455.00 |Max.   :622.000 |Max.   :1.000 |"
[9] "|   |NA               |NA               |NA               |NA              |NA               |NA              |NA              |NA              |NA's   :7878  |"

WibbletheDuck commented 7 years ago

Great, thanks! I have checked the effect of the simulation count as well. I'll let you know once I track down the problem.

WibbletheDuck commented 7 years ago

Welp, found it. My modified load.genus.order.lookup() was pointing at the wrong genus_order_lookup.csv (the original from /zae, not the one modified by output-genus_order_lookup.csv.R). Blah.

Thanks again for your help! If you're interested, Nextjournal plans to launch near the end of September. I hope to have a completed and polished port of the article by then, and the site should be open and ready for new users if you'd like to play around with it yourselves.

wcornwell commented 7 years ago

Glad you found it! That makes sense--taxonomy is always a nightmare. We found some bugs in the original lookup. That taxonomy nightmare led us to create the versioned dataset system in taxonlookup, which should avoid that problem going forward.

Let me know when you're ready to launch and I'll tweet about Nextjournal!

richfitz / wood

Reproducing Results #20