osulp / Scholars-Archive

ScholarsArchive@OSU, institutional repository for Oregon State University
https://ir.library.oregonstate.edu/
14 stars 3 forks source link

Analytics accuracy check #2045

Open simholt opened 4 years ago

simholt commented 4 years ago

Descriptive summary

@clarallebot noticed some high download numbers for three recent deposits and wonders how accurate they are. The number of downloads per day are similar or identical (85 for each on 2/20). The pageviews for the corresponding records are in single-digits.

Analytics are important to creators; our numbers need to be as accurate as possible.

I looked at some other Public items that were recently deposited and they have very high downloads: 1144 downloads for this thesis but only 12 pageviews 928 downloads with 15 pageviews

Expected behavior

Analytics to exclude bots and crawlers.

Actual behavior

  1. Peer Review of Research Data Submissions v692td402 Went live 2/18/20 Analytics 2/18: 43 downloads 2/19: 86 2/20: 85 Record pageviews: 2

  2. Remediation Data Management Plans vm40xz548 Went live 2/18/20 Analytics 2/18: 47 downloads 2/19: 86 2/20: 85 Record pageviews: 4

  3. Give Them What They Want 9593v2274 Went live 2/19/20, mid-day Analytics 2/19: 52 downloads 2/20: 85 Record pageviews: 4

decimalator commented 4 years ago

From @KennaW

We maintain a dynamic exclusion list of known robots and crawlers at https://github.com/atmire/COUNTER-Robots. All COUNTER compliant entities use this list to eliminate bots and crawlers. I hope it helps.

Do let us know if you find any bots or crawlers not on this list, our Robots and Crawlers working group will review and update the list accordingly.

KennaW commented 4 years ago

After a conversation with the university's google analytics contact (Kelly Holcomb :) ), she recommended trying to route the 'real' traffic through a custom url campaign https://support.google.com/analytics/answer/1033863?hl=en

https://ga-dev-tools.appspot.com/campaign-url-builder/

carakey commented 1 year ago

With the recent change from Google Analytics 3 to GA 4, we've been looking at the views and downloads for SA again. Reliable usage statistics are still important to creators. "Reliable" and "accurate" means real humans viewing and downloading SA content.

Some artificially high counts may due caused by counting thumbnail hits, which would be resolved with #1889. The main culprit seems to be bot traffic, and we should leverage GA4 improvements to bot filtering.

@KennaW and @CGillen did some exploratory work and likely have more to say.

CGillen commented 8 months ago

After doing a little more exploring. It looks like for 1/7/2024: Google analytics reports 49437 Downloads. Logs read about 45558. Logs are a rough estimate since our log parsing utility doesn't quite have the right tool set to run this type analysis For both of these, thumbnail downloads were excluded

This seems within reason of being accurate for raw download visits. Not sure on 'reliability.'

Regular page visits are way off. Again, log parsing is imperfect and is likely over reporting with clear bot traffic, but excluding downloads (and thumbnails), admin/dashboard, edit/new interfaces, we got around 1m page visits for 1/7/2024 GA4 reports 1053. Unfortunately GA4 automatically applies what ever bot detection and filtering it wants w/o any amount of transparency to us. It's impossible to tell if it looks like all our traffic looks like bot traffic to them for some reason or if we're not reporting correctly

CGillen commented 8 months ago

Still not sure why page_view is not as high as it was previously. Investigation continues @carakey For future clarity do you want to remove page_view tracking on download? This would make page_view reflect actual page traffic and Download will remain only download traffic. As it is page_view includes all of Download.

carakey commented 8 months ago

@carakey For future clarity do you want to remove page_view tracking on download? This would make page_view reflect actual page traffic and Download will remain only download traffic. As it is page_view includes all of Download.

Yes, I think that would improve understandability of our stats. Thanks!

CGillen commented 6 months ago

Ok, I'm seeing analytics in this kind of break down:

GA4: page_views: 1k - 2.2k for regular days and peaked on 5.4k Downloads: 6k - 12k for regular days and peaked on 27k - 33k

Previous GA: page_views: 1.1k - 2k for regular days and peaked on 3.5k Downloads: 8k - 13k for regular days and peaked on 35k - 111k (Hugely anomalous over 20 months compared to 3 months for GA4)

To me, this seems pretty reasonably accurate now @carakey?

carakey commented 6 months ago

@CGillen I think there's been solid improvement. I agree these numbers seem reasonable, or at least I don't have any data to say otherwise.

I think Clara's original concern for this ticket was about download numbers being much higher than page views, which we're still seeing at this macro level with 2K page views vs 10K downloads daily, and this sort of doesn't agree with how library folks expect users to navigate to works -- search, arrive at landing page (+1 page view), and then decide to download (+1 download) -- or sometimes decide not to download, which would result in overall more views than downloads. I think at least one of these things is happening:

  1. The assumption is wrong, and the majority of users get to SA with a download link from Google/Scholar or other referring source;
  2. What GA calls "page_views" and "Downloads" aren't the same as how humans/librarians understand these words;
  3. The original suspicion, that tons of bot traffic is racking up downloads while bypassing views.

...Or is it something else entirely? Do we have any way to know?