Open simholt opened 4 years ago
From @KennaW
We maintain a dynamic exclusion list of known robots and crawlers at https://github.com/atmire/COUNTER-Robots. All COUNTER compliant entities use this list to eliminate bots and crawlers. I hope it helps.
Do let us know if you find any bots or crawlers not on this list, our Robots and Crawlers working group will review and update the list accordingly.
After a conversation with the university's google analytics contact (Kelly Holcomb :) ), she recommended trying to route the 'real' traffic through a custom url campaign https://support.google.com/analytics/answer/1033863?hl=en
With the recent change from Google Analytics 3 to GA 4, we've been looking at the views and downloads for SA again. Reliable usage statistics are still important to creators. "Reliable" and "accurate" means real humans viewing and downloading SA content.
Some artificially high counts may due caused by counting thumbnail hits, which would be resolved with #1889. The main culprit seems to be bot traffic, and we should leverage GA4 improvements to bot filtering.
@KennaW and @CGillen did some exploratory work and likely have more to say.
After doing a little more exploring. It looks like for 1/7/2024
:
Google analytics reports 49437 Downloads.
Logs read about 45558. Logs are a rough estimate since our log parsing utility doesn't quite have the right tool set to run this type analysis
For both of these, thumbnail downloads were excluded
This seems within reason of being accurate for raw download visits. Not sure on 'reliability.'
Regular page visits are way off. Again, log parsing is imperfect and is likely over reporting with clear bot traffic, but excluding downloads (and thumbnails), admin/dashboard, edit/new interfaces, we got around 1m page visits for 1/7/2024
GA4 reports 1053.
Unfortunately GA4 automatically applies what ever bot detection and filtering it wants w/o any amount of transparency to us. It's impossible to tell if it looks like all our traffic looks like bot traffic to them for some reason or if we're not reporting correctly
Still not sure why page_view
is not as high as it was previously. Investigation continues
@carakey For future clarity do you want to remove page_view
tracking on download? This would make page_view
reflect actual page traffic and Download
will remain only download traffic. As it is page_view
includes all of Download
.
@carakey For future clarity do you want to remove
page_view
tracking on download? This would makepage_view
reflect actual page traffic andDownload
will remain only download traffic. As it ispage_view
includes all ofDownload
.
Yes, I think that would improve understandability of our stats. Thanks!
Ok, I'm seeing analytics in this kind of break down:
GA4: page_views: 1k - 2.2k for regular days and peaked on 5.4k Downloads: 6k - 12k for regular days and peaked on 27k - 33k
Previous GA: page_views: 1.1k - 2k for regular days and peaked on 3.5k Downloads: 8k - 13k for regular days and peaked on 35k - 111k (Hugely anomalous over 20 months compared to 3 months for GA4)
To me, this seems pretty reasonably accurate now @carakey?
@CGillen I think there's been solid improvement. I agree these numbers seem reasonable, or at least I don't have any data to say otherwise.
I think Clara's original concern for this ticket was about download numbers being much higher than page views, which we're still seeing at this macro level with 2K page views vs 10K downloads daily, and this sort of doesn't agree with how library folks expect users to navigate to works -- search, arrive at landing page (+1 page view), and then decide to download (+1 download) -- or sometimes decide not to download, which would result in overall more views than downloads. I think at least one of these things is happening:
...Or is it something else entirely? Do we have any way to know?
Descriptive summary
@clarallebot noticed some high download numbers for three recent deposits and wonders how accurate they are. The number of downloads per day are similar or identical (85 for each on 2/20). The pageviews for the corresponding records are in single-digits.
Analytics are important to creators; our numbers need to be as accurate as possible.
I looked at some other Public items that were recently deposited and they have very high downloads: 1144 downloads for this thesis but only 12 pageviews 928 downloads with 15 pageviews
Expected behavior
Analytics to exclude bots and crawlers.
Actual behavior
Peer Review of Research Data Submissions v692td402 Went live 2/18/20 Analytics 2/18: 43 downloads 2/19: 86 2/20: 85 Record pageviews: 2
Remediation Data Management Plans vm40xz548 Went live 2/18/20 Analytics 2/18: 47 downloads 2/19: 86 2/20: 85 Record pageviews: 4
Give Them What They Want 9593v2274 Went live 2/19/20, mid-day Analytics 2/19: 52 downloads 2/20: 85 Record pageviews: 4