Closed hpiwowar closed 13 years ago
I've found the problem; the xml metrics data are missing for these three articles. Possible solutions:
thoughts?
Another solution:
I'm leaning towards #4, since that is what NA is there for. Thoughts? To implement this, maybe put "NA" or "?" or something else in the raw data? (or leave the cell blank, but that can be misleading)
Then you have different numbers of articles for different metrics. For instance, "number of Mendeley-read articles in 2008" can't be directly compared to "number of CiteULike-bookmarked articles in 2008", because they have different denominators.
I suggest we report them in our raw counts, then "three articles were missing instrumental plos data, so were removed from further analysis"
Sure, I'm happy to throw these out to keep life simple.
fwiw, I think that R handles the denominator issue well; missing data is a big part of life in stats. R knows about "NA" and treats it in special ways when calculating mean() etc. Admittedly the comparison couldn't be "number of Mendeley-read articles in 2008" but rather a comparison of percentages of Mendeley-read articles, given all (applicable) articles, but R would take care of the appropriate calcs.
Anyway. For the sake of making the raw data robust do you want to strip the articles out of the data files, or include them and just sub the missing values with NA or ? or something? I'd suggest the latter if it isn't too much work.... then I can drop the articles in analysis preprocessing.
I figure you are the stats head chef. If you still like #4, you should go for it, as long as you don't mind adding a whole bunch of [!is.na(x.pdfDownloadsCount)] to save a few data points. Or whichever flavour of #3 you prefer (save actually changing the raw dataset, which I agree would be a mistake).
going with #4 for now and closing the ticket. Will reopen if it ends up not being workable.
It looks like there may be three articles with missing html/pdf/xml download stats. A quick look at the metrics tab in PLoS suggests some of these shouldn't be zero: