Closed WibbletheDuck closed 7 years ago
ping @mwpennell @wcornwell - any spare time to look at this at all?
Hmmmm interesting one: I just re-did it on a couple of machines and I'm getting the same distributions as was in the original paper:
I'll dig a bit deeper, but need more clues. Can you post the whole wood-ms-supporting.pdf doc that you're reproducing?
@wcornwell Apologies for the slow response—some platform issues were being worked out over the past week.
We do not generate a PDF in our version of the article—a Nextjournal article aims to be a single entity that includes and preserves input data, perpetually runnable code, results, and expository text. The majority of the plots (all but the two that use diversitree) have been converted to use interactive Plotly functionality; however, I have checked that our issues aren't some sort of display problem: the raw output data shows differences.
The two most obvious differences are in the plot above (generated by the Distributions
code cell) and the 'Fraction By' plots like those resulting from the Fraction by Order
cell.
Can you confirm what raw input data you're using? I've tried both the dryad-cache.tar.gz
and theplantlist-cache.tar.gz
files from the github release and freshly downloaded data, and while they differ from each other neither looks like your results.
...Actually, whatever data you're using, could you possibly send me your generated RDS files (dat.g.rds, woodiness.rds, etc)? Comparing them to mine would help me narrow down where exactly things are diverging.
Cool! BTW I like the Nextjournal project a lot--very cool thing to do.
Anyway, that one above I used freshly downloaded data. I'm pretty confident the problem isn't with the underlying data.
The dryad data didn't (can't) change. And I'm 95% sure The Plant List v1.1 hasn't changed--they've stated they are not updating it. And I haven't found any changes for a related project that relies on those data pretty heavily. On the other hand, there very well might be issues related to R library versions and/or namespace issues. We already had one issue with changes in dplyr breaking things.
Anyway, here are the RDS files from my latest run.
From looking through your version, the replication of the simulations was reduced from 1000 to 100, but you were probably aware of that change already.
what I just got from another simulation:
> kable(summary(dat.g))
[1] "| | genus | family | order | W | V | H | N | K | p |"
[2] "|:--|:----------------|:----------------|:----------------|:---------------|:----------------|:---------------|:---------------|:---------------|:-------------|"
[3] "| |Length:15252 |Length:15252 |Length:15252 |Min. : 0.000 |Min. : 0.00000 |Min. : 0.000 |Min. : 1.00 |Min. : 0.000 |Min. :0.000 |"
[4] "| |Class :character |Class :character |Class :character |1st Qu.: 0.000 |1st Qu.: 0.00000 |1st Qu.: 0.000 |1st Qu.: 1.00 |1st Qu.: 0.000 |1st Qu.:0.000 |"
[5] "| |Mode :character |Mode :character |Mode :character |Median : 0.000 |Median : 0.00000 |Median : 0.000 |Median : 3.00 |Median : 0.000 |Median :1.000 |"
[6] "| |NA |NA |NA |Mean : 1.498 |Mean : 0.03311 |Mean : 1.046 |Mean : 20.73 |Mean : 2.545 |Mean :0.556 |"
[7] "| |NA |NA |NA |3rd Qu.: 1.000 |3rd Qu.: 0.00000 |3rd Qu.: 0.000 |3rd Qu.: 11.00 |3rd Qu.: 1.000 |3rd Qu.:1.000 |"
[8] "| |NA |NA |NA |Max. :622.000 |Max. :43.00000 |Max. :497.000 |Max. :2455.00 |Max. :622.000 |Max. :1.000 |"
[9] "| |NA |NA |NA |NA |NA |NA |NA |NA |NA's :7878 |"
Great, thanks! I have checked the effect of the simulation count as well. I'll let you know once I track down the problem.
Welp, found it. My modified load.genus.order.lookup()
was pointing at the wrong genus_order_lookup.csv
(the original from /zae, not the one modified by output-genus_order_lookup.csv.R). Blah.
Thanks again for your help! If you're interested, Nextjournal plans to launch near the end of September. I hope to have a completed and polished port of the article by then, and the site should be open and ready for new users if you'd like to play around with it yourselves.
Glad you found it! That makes sense--taxonomy is always a nightmare. We found some bugs in the original lookup. That taxonomy nightmare led us to create the versioned dataset system in taxonlookup
, which should avoid that problem going forward.
Let me know when you're ready to launch and I'll tweet about Nextjournal!
Hello,
I'm working with Nextjournal, whose goals include the establishment of a platform for open and long-term reproducible research. The open nature of your paper led us to attempt porting it over as a case study: in how difficult it would be to get a pretty recent data processing and presentation code structure into working order anywhere, in the work required to bring such a body of code onto the Nextjournal platform in particular, and in accurately reproducing both the data and graphical outputs.
We've made significant progress, which you can take a look at here if you wish: https://nextjournal.com/mpd/how-much-of-the-world-is-woody/. However, I've recently noticed that my distribution results for the simulations (starting at the 'Distributions' code cell) are slightly off compared to the original article, both in shape and in absolute position on the x axis. The first example is just the first distribution plot: comparing to the original article's plot at http://richfitz.github.io/wood/figure/distribution_raw.png mine appears to be shifted to the right by over 1% and also slightly less 'peaky'. I've checked as much of the obvious stuff as I can think of: the processing code appears identical (and in many cases is simply copy-pasted), the input data I'm using comes direct from the cache files in your repo, I've tried with the 'production' values of nrep=1000, etc. Everything I can think of seems to be identical, yet this result is off, and the later 'fraction by X' plots are also different in their details.
We've finally decided I should contact you about this, to see if you have any ideas on what could be wrong. The most obvious possibility I can think of is that the cache files in the repo might have slightly more up-to-date data from theplantlist and dryad than what was used in publication, but I welcome any other ideas you might have on possible sources of such variance.
Regards, Micah Dombrowski