hpiwowar / alt-metrics_stats

Stats for PLoS alt-metrics project
3 stars 0 forks source link

missing download counts? #4

Closed hpiwowar closed 13 years ago

hpiwowar commented 13 years ago

It looks like there may be three articles with missing html/pdf/xml download stats. A quick look at the metrics tab in PLoS suggests some of these shouldn't be zero:

t(dat.raw.eventcounts[dat.raw.eventcounts$xmlDownloadsCount == 0,]) doi "10.1371/journal.pbio.0020069" "10.1371/journal.pbio.0020303" "10.1371/journal.pbio.0040286" pubDate "2004-03-16T00:00:00+00:00" "2004-09-21T00:00:00+00:00" "2006-08-29T00:00:00+00:00"
journal "pbio" "pbio" "pbio"
f1000Factor "false" "false" "false"
backtweetsCount "0" "0" "0"
deliciousCount "0" "0" "0"
facebookShareCount "0" "0" "0"
facebookLikeCount "0" "0" "0"
facebookCommentCount "0" "0" "0"
facebookClickCount "0" "0" "0"
mendeleyReadersCount "24" " 7" "33"
almBlogsCount "0" "0" "0"
pdfDownloadsCount "0" "0" "0"
xmlDownloadsCount "0" "0" "0"
htmlDownloadsCount "0" "0" "0"
almCiteULikeCount "0" "0" "0"
almScopusCount "0" "0" "0"
almPubMedCount "0" "0" "0"
almCrossRefCount "0" "0" "0"
plosCommentCount "0" "0" "0"
plosCommentResponsesCount "0" "0" "0"
wikipediaCites "2" "2" "1"

jasonpriem commented 13 years ago

I've found the problem; the xml metrics data are missing for these three articles. Possible solutions:

  1. alert PLoS, get them to fix this (will likely take a long time)
  2. get data from another source, enter it manually (don't like this, as it makes work harder to duplicate in future)
  3. throw them out (don't like this either, as we lose a lot of tasty events)

thoughts?

hpiwowar commented 13 years ago

Another solution:

  1. Treat the missing data as NA in the stats analysis. This will result in throwing the articles out for many types of analyses, but not all.

I'm leaning towards #4, since that is what NA is there for. Thoughts? To implement this, maybe put "NA" or "?" or something else in the raw data? (or leave the cell blank, but that can be misleading)

jasonpriem commented 13 years ago

4 is good, but it makes the analysis harder; we've got to remember to throw these three out of the denominator for a lot of metrics (everything with "alm" in front of the name). For example, you'd have to remove these three from the total articles count when you figures crossref citations / article.

Then you have different numbers of articles for different metrics. For instance, "number of Mendeley-read articles in 2008" can't be directly compared to "number of CiteULike-bookmarked articles in 2008", because they have different denominators.

I suggest we report them in our raw counts, then "three articles were missing instrumental plos data, so were removed from further analysis"

hpiwowar commented 13 years ago

Sure, I'm happy to throw these out to keep life simple.

fwiw, I think that R handles the denominator issue well; missing data is a big part of life in stats. R knows about "NA" and treats it in special ways when calculating mean() etc. Admittedly the comparison couldn't be "number of Mendeley-read articles in 2008" but rather a comparison of percentages of Mendeley-read articles, given all (applicable) articles, but R would take care of the appropriate calcs.

Anyway. For the sake of making the raw data robust do you want to strip the articles out of the data files, or include them and just sub the missing values with NA or ? or something? I'd suggest the latter if it isn't too much work.... then I can drop the articles in analysis preprocessing.

jasonpriem commented 13 years ago

I figure you are the stats head chef. If you still like #4, you should go for it, as long as you don't mind adding a whole bunch of [!is.na(x.pdfDownloadsCount)] to save a few data points. Or whichever flavour of #3 you prefer (save actually changing the raw dataset, which I agree would be a mistake).

hpiwowar commented 13 years ago

going with #4 for now and closing the ticket. Will reopen if it ends up not being workable.