Add information to the tag bot

Malabarba commented 9 years ago

Two things I think the tag bot should do:

[ ] Print the tags ordered by frequency.
[ ] Mark somehow which tags are synonyms of which.

Malabarba commented 9 years ago

This is needed for #251

vermiculus commented 9 years ago

I'll see if I can't do this over the weekend. I've spent all week working with indexing massive datasets; this seems like an essentially similar problem.

My immediate thought is to just introduce one more layer across the board:

(("main-tag-A" "synonym-1" "synonym-2" "synonym-3")
 ("main-tag-B")
 ("main-tag-C" "synonym-a") …)

It would come at a relatively nominal cost of space, but I don't think it's going to get any better. Thoughts?

My only afterthought is the potential wisdom in using vectors, but I forget exactly how we're using this structure and whether vectors would be effective without much hassle. IIRC, vectors have a significantly faster implementation due to being sequential.

Malabarba commented 9 years ago

I think it's fine. While we're doing backwards incompatible changes, might as well try printing the strings as symbols to save a bit of space.

The only problem I see is that tags containing . would be printed with a \., but there's a variable that controls that.

vermiculus commented 9 years ago

I've tried

(print-escape-nonascii
 print-charset-text-property
 print-length
 print-level
 print-circle
 print-escape-multibyte
 print-continuous-numbering
 print-escape-newlines
 print-gensym
 print-quoted)

but I can't seem to find the variable you're talking about. (That's almost every variable that begins with print.)

There's always sed if we can't find the variable.

Malabarba commented 9 years ago

No sorry. As long as we use princ we're fine. The whole point of princ is that it doesn't quote characters.

vermiculus commented 9 years ago

Yep:

(princ 'hi.there (current-buffer))

vermiculus commented 9 years ago

Also, I believe the tags are already returned to use in order of frequency.

Yes, popular is the default sort: http://api.stackexchange.com/docs/tags, but another thought: As a tag's popularity fluctuates, its position in the list will alter as well. This will increase the diff size. I thought about including the count property of the information as well, but this would change nearly every time we pull data and the repo history will increase without bound.

Long story short, the sorting of the printed list is going to change from alphabetic to popularity, but expect a rise in repo growth over time (unless you have a better option). I'm starting to wonder if we really should be tracking this stuff – they aren't changes we're making.

Perhaps… perhaps we can do a git rewrite every time we push to data? It's late and I may not be thinking clearly – I don't know what adverse effects this may have on clones, but nobody should be making changes to the data branch anyways.

Malabarba commented 9 years ago

Hm, I like that idea. A few points:

I don't foresee us needing to know the exact count, do you? If not, we should probably not print them for now because it would significantly increase the size of the files (I would guess like 50% increase).
I'm ok with rewriting. Instead of commiting, the bot can ammend and then push --force. It would only mess with people who decide to do something on the bot branch (which I'm really not concerned with).
Important Changing the order of the tags won't break anything on current master, but restructuring the list to group tags and synonyms will. This means we're going to have to change the bot directory when we make this change (to something like bot-3.0/), so that people using sx-2.0 don't get confronted with errors.

vermiculus / sx.el

Add information to the tag bot #252