publiclab / plots2

a collaborative knowledge-exchange platform in Rails; we welcome first-time contributors! :balloon:
https://publiclab.org
GNU General Public License v3.0
958 stars 1.83k forks source link

Explainable discrepancy in tag stats CSV download counts and tab labels on /tag/air-quality #8642

Open jywarren opened 4 years ago

jywarren commented 4 years ago

Jeanette from the PL staff noted a discrepancy - when downloading a CSV and summing notes, questions, and wikis, the totals Jeanette got are:

From /stats: notes = 206; questions = 97; wikis = 42

However this was for a range of: https://publiclab.org/tag/air-quality/stats?utf8=%E2%9C%93&start=01-01-2010&end=14-10-2020&commit=Go

These don't match the tab totals shown at https://publiclab.org/tag/air-quality, of:

247 notes, 140 questions, 53 wikis (note one more question was shown since Jeanette's screenshot)

image

Exact discrepancy

A full date range CSV i got showed:

303 notes | 97 questions | 42 wikis

that means we are showing discrepancies of

-56 notes | 42 questions | 11 wikis (where the /tag page has this # MORE than the stats CSV)

Known sources of discrepancy

First, noting that some of the questions are for notes tagged with question:air-quality but which lack air-quality - this accounts for some or all of the 139-97 = 42 questions discrepancy.

Second, the stats pages do not count notes, questions, or wikis which bear tags which have a parent tag (a system we are trying to phase out) of air-quality. The last line of this section of code shows those extra nodes getting included for the /tag/air-quality page.

I was able to find 61 notes and 11 wikis that bear a child tag of air-quality, which has affected this count. That seems to account for the wikis discrepancy.

irb(main):034:0> Node.where(status: 1, type: 'note').includes(:revision, :tag).references(:term_data, :node_revisions).where('term_data.name = ?', 'air-quality').size
=> 304
irb(main):035:0> Node.where(status: 1, type: 'note').includes(:revision, :tag).references(:term_data, :node_revisions).where('term_data.name = ? OR term_data.parent = ?', 'air-quality', 'air-quality').size
=> 365

After accounting for 61 extra notes, we actually have 61 + 56 = 117 notes shown on the CSV which were not shown on the /tag page.

But, according to these lines, we exclude all questions of any kind from this note count. Let's see how that affects the count:

irb(main):035:0> Node.where(status: 1, type: 'note').includes(:revision, :tag).references(:term_data, :node_revisions).where('term_data.name = ? OR term_data.parent = ?', 'air-quality', 'air-quality').where('node.nid NOT IN (?)', @qids).size
=> 247

So, that took us from 365 to 247, if we are including parent tags. That's the number shown on /tags/air-quality.

Without counting parent tags OR questions, we get 206 notes - that's vs. 303 in the CSV.

Let's look at where the CSV is being compiled:

https://github.com/publiclab/plots2/blob/27a3839154e0cec071860448f99697f9c831042c/app/models/tag.rb#L216-L239

This is a little convoluted, but i traced through it and it seems OK.

Running Tag.nodes_for_period() on the whole 10 year span returned 248, which is only 1 off:

irb(main):051:0> Tag.nodes_for_period('note',nids,(Time.now - 10.year).to_i, Time.now.to_i).size
=> 248

That's for the same nids collection as we got for the tags page - with parent tags, and excluding questions. Let's try running it without the parent tags, but leaving the questions in...

irb(main):060:0> Tag.nodes_for_period('note',nids,(Time.now - 10.year).to_i, Time.now.to_i).size
=> 305

OK, so the discrepancy seems to be (within an error of 2 notes) that the stats are excluding parent tags and including questions.


Takeaway

I believe this means that we don't need to change any queries, but we should add some of these caveats to the stats pages for those wondering. I can make an FTO once we settle on explanatory text!

Linking this thread to this explanation of questions counts on tag pages: https://github.com/publiclab/plots2/issues/8246

jywarren commented 4 years ago

The explanatory text currently says:

The graphs above are stacked, and questions are counted both on their own as well as part of the tally for notes (because they are a form of note).

So the text could be expanded to:

The graphs above are stacked, and questions are counted both on their own as well as part of the tally for notes (because they are a form of note). Additional discrepancies may come from the tag page also listing questions tagged with "question:_____" but lacking the base tag, and also listing notes with only "child tags" of the base tag, in a system we are planning to slowly deprecate.

jywarren commented 4 years ago

Link to "deprecating tag aliasing" - https://github.com/publiclab/plots2/issues/6367