jywarren / plots2

The Public Lab website!
http://publiclab.org
GNU General Public License v3.0
17 stars 2 forks source link

de-duplicate DrupalTag and DrupalCommunityTag into just Tag #181

Closed jywarren closed 10 years ago

jywarren commented 11 years ago

remove spaced tags too!

jywarren commented 10 years ago

I have a draft of this in the consolidate-tags branch of my repo (linked to just above) but want to try this out on an up to date db copy first. Asking Dogi for a test server. This also echoes all deleted things to console so we can visually confirm things that go right/wrong. We keep old copies of the whole db, so at the end of the day we have an "oh $#!*" backup plan too.

The whole thing is also wrapped in a transaction, and I've added lines converting all active tables to InnoDB so that'll actually work.

jywarren commented 10 years ago

Haven't added a step removing spaces from tags, but we can do that too. Also once this is done, we can go in and remove a lot of extra lines where we had to query both DrupalNodeCommunityTag and DrupalNodeTag. Then we can do more efficient joins and lots of other fun things.

btbonval commented 10 years ago

For more github issue searching power: deduplicate deduplication duplicate duplication tags tag

http://publiclab.org/wiki/gsoc-2014 has two "gsoc-2014" tags. Jeff thinks this might be two distinct kinds of tag with the same name. I could go into the database to check or I could wait to see if Jeff's dedup fix takes care of the issue whenever it gets run. Probably the latter is more efficient.

jywarren commented 10 years ago

This seems to have run correctly on http://dev.publiclab.org; a log of deleted orphaned DrupalTags is stored in consolidated-tags.log in /home/warren of dev.publiclab.org

One exception: it did delete the DrupalTag for "all notes EVAR!1!" -- which I believe deletes the universal subscription.

What's the best way to confirm that this worked without destroying data?

btbonval commented 10 years ago

Yeah the "all notes EVAR!1!" tag is a special one. I believe it doesn't matter what it is called, so long as it is id=0. There shouldn't be a problem as long as some tag has id=0.

Did you change the model name or just the underlying table for the new Tag?

I imagine the best way to check it is to 1) click some known posts to see if their tags are still in order and displayed (compare dev to regular) 2) try to subscribe to a tag 3) create a new post with that tag

I suppose step 3 doesn't matter since email isn't gonna happen. Maybe rewrite the mailer to drop the email to a file?

On Sun, May 4, 2014 at 4:33 PM, Jeffrey Warren notifications@github.comwrote:

This seems to have run correctly on http://dev.publiclab.org; a log of deleted orphaned DrupalTags is stored in consolidated-tags.log in /home/warren of dev.publiclab.org

One exception: it did delete the DrupalTag for "all notes EVAR!1!" -- which I believe deletes the universal subscription.

What's the best way to confirm that this worked without destroying data?

— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144062 .

jywarren commented 10 years ago

I didn't change the model or the name; just:

  1. consolidated everything into the DrupalNode > DrupalNodeCommunityTag > DrupalTag association rather than DrupalNode > DrupalNodeTag > DrupalTag
  2. edited all DrupalNodeCommunityTags to point to the same DrupalTag if they have the same DrupalTag.name
  3. deleted all orphaned DrupalTags
  4. ensured that adding new tags adds to an existing DrupalTag instead of creating a new one

Overall, things check out. My worry is that some old DrupalNodeTags were not ported over, or that in the consolidation, we may have lost DrupalNodeCommunityTags. I just need to do this rigorously, checking a sampling of those, to see that they made it. Maybe also tally total # of each and be sure the #s still check out. But this seems to have gone well.

The next step would be to change the table/model names to just "Tag" and get rid of all redundant code, which should be fun as we'll get to be more optimal.

jywarren commented 10 years ago

All of these steps are annotated in the migration, so you can check if you have a moment: https://github.com/jywarren/plots2/blob/consolidate-tags/db/migrate/20140429190219_consolidate_tags.rb#L28

btbonval commented 10 years ago

Okay. Just making sure the subscription code won't break for ALL NOTES EVER?!! which hard codes DrupalTag check for id 0. Sad pandas were sad that day, b/c hard code.

When you say "orphaned" tag, do you mean tags that are not associated with any notes? Should a tag be considered orphaned if a user subscribed to the tag pre-emptively? Or in the case of ALL NOTES EVEAR!!1!, it was never associated with a note, but users definitely subscribe to it.

On Sun, May 4, 2014 at 4:48 PM, Jeffrey Warren notifications@github.comwrote:

I didn't change the model or the name; just:

  1. consolidated everything into the DrupalNode > DrupalNodeCommunityTag > DrupalTag association rather than DrupalNode > DrupalNodeTag > DrupalTag
  2. edited all DrupalNodeCommunityTags to point to the same DrupalTag if they have the same DrupalTag.name
  3. deleted all orphaned DrupalTags
  4. ensured that adding new tags adds to an existing DrupalTag instead of creating a new one

Overall, things check out. My worry is that some old DrupalNodeTags were not ported over, or that in the consolidation, we may have lost DrupalNodeCommunityTags. I just need to do this rigorously, checking a sampling of those, to see that they made it. Maybe also tally total # of each and be sure the #s still check out. But this seems to have gone well.

The next step would be to change the table/model names to just "Tag" and get rid of all redundant code, which should be fun as we'll get to be more optimal.

— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144474 .

jywarren commented 10 years ago

yeah i'm def. going to fix the ALL NOTES exception before merging this into master

Oh, i had forgotten that we can now have tags with no nodes but that do have subscriptions. You're right, i'll have to remove those from the definition of "orphaned". That should solve the ALL NOTES thing too. Good catch!

btbonval commented 10 years ago

Yeah we're sorta talking past each other in parallel. You sent me the link to code while I asked the question which the code answered.

|| nt.tag.tag_selection.nil?

tag_selection belongs to drupal_tag, but I notice drupal_tag doesn't has_many tag_selection. so that might need to get fixed for the above to work? Not sure if ActiveRecord implicitly figures out forwards and backwards relationships the way all Ruby infers everything, or if one really does need to supply both belongs_to and has_many.

On Sun, May 4, 2014 at 4:53 PM, Jeffrey Warren notifications@github.comwrote:

yeah i'm def. going to fix the ALL NOTES exception before merging this into master

Oh, i had forgotten that we can now have tags with no nodes but that do have subscriptions. You're right, i'll have to remove those from the definition of "orphaned". That should solve the ALL NOTES thing too. Good catch!

— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144646 .

jywarren commented 10 years ago

i actually just added a has_many :tag_selection but yep. I'll take a break from coding now so we don't expend any more redundant brain power :-)

thanks bryan! btw, it's great peace of mind to have this dev.publiclab.org server to test things on; although now I have to reimport a fresh copy of the database to do it properly.

btbonval commented 10 years ago

Yup. I like virtual machine appliances running on something like VirtualBo. Once you get the thing up and running, you save the appliance (which can be less than 2 GiB even if the hard drive is 10 GiB large), destroy the machine when you mess up, reload the appliance (which is pretty quick), and then try again.

It's basically a video game save state for an entire system.

dev.publiclab.org might be a VM, but I don't see you having the nice ability to save and reload state that way I would normally do with VirtualBox. At least not without Dogi getting involved. -Bryan

On Sun, May 4, 2014 at 5:02 PM, Jeffrey Warren notifications@github.comwrote:

i actually just added a has_many :tag_selection but yep. I'll take a break from coding now so we don't expend any more redundant brain power :-)

thanks bryan! btw, it's great peace of mind to have this dev.publiclab.orgserver to test things on; although now I have to reimport a fresh copy of the database to do it properly.

— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144917 .

btbonval commented 10 years ago

Downside to virtualbox style VMs: they don't run on my netbook at all.

Remote VMs like Dogi sets up are really nice for netbook dev work ;)

On Sun, May 4, 2014 at 5:05 PM, Bryan btbonval@gmail.com wrote:

Yup. I like virtual machine appliances running on something like VirtualBo. Once you get the thing up and running, you save the appliance (which can be less than 2 GiB even if the hard drive is 10 GiB large), destroy the machine when you mess up, reload the appliance (which is pretty quick), and then try again.

It's basically a video game save state for an entire system.

dev.publiclab.org might be a VM, but I don't see you having the nice ability to save and reload state that way I would normally do with VirtualBox. At least not without Dogi getting involved. -Bryan

On Sun, May 4, 2014 at 5:02 PM, Jeffrey Warren notifications@github.comwrote:

i actually just added a has_many :tag_selection but yep. I'll take a break from coding now so we don't expend any more redundant brain power :-)

thanks bryan! btw, it's great peace of mind to have this dev.publiclab.org server to test things on; although now I have to reimport a fresh copy of the database to do it properly.

— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144917 .

jywarren commented 10 years ago

Ah, we should ask him if that's possible. Even just saving a copy of the database to overwrite would get us most of the way there. On May 4, 2014 5:05 PM, "Bryan Bonvallet" notifications@github.com wrote:

Yup. I like virtual machine appliances running on something like VirtualBo. Once you get the thing up and running, you save the appliance (which can be less than 2 GiB even if the hard drive is 10 GiB large), destroy the machine when you mess up, reload the appliance (which is pretty quick), and then try again.

It's basically a video game save state for an entire system.

dev.publiclab.org might be a VM, but I don't see you having the nice ability to save and reload state that way I would normally do with VirtualBox. At least not without Dogi getting involved. -Bryan

On Sun, May 4, 2014 at 5:02 PM, Jeffrey Warren notifications@github.comwrote:

i actually just added a has_many :tag_selection but yep. I'll take a break from coding now so we don't expend any more redundant brain power :-)

thanks bryan! btw, it's great peace of mind to have this dev.publiclab.orgserver to test things on; although now I have to reimport a fresh copy of the database to do it properly.

— Reply to this email directly or view it on GitHub< https://github.com/jywarren/plots2/issues/181#issuecomment-42144917> .

— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144986 .

btbonval commented 10 years ago

Destroying the VM and rebooting it from an image will almost certainly be as fast or faster than destroying and loading the database. The image is just a binary bit-for-bit thing, whereas the database tries to process data as it goes, usually with text and numbers and junk.

Definitely worth asking Dogi if there's a save state or a machine export/import option to speed up dev system state recovery.

jywarren commented 10 years ago

well, the save/reinstate may have been responsible for the database error i got when first starting this one up. So we'd want to take the main site down gracefully before making a nice copy or something. Or we could nicely shut down a clean copy of dev.publiclab.org, and use the gracefully shut-down version as the template.

btbonval commented 10 years ago

I thought that table corruption error was present on the main public lab as well? Isn't that why new notes couldn't be posted?

Saving state shouldn't cause a problem for the current image, it might corrupt the machine running from a restored image though.

On Sun, May 4, 2014 at 5:11 PM, Jeffrey Warren notifications@github.comwrote:

well, the save/reinstate may have been responsible for the database error i got when first starting this one up. So we'd want to take the main site down gracefully before making a nice copy or something. Or we could nicely shut down a clean copy of dev.publiclab.org, and use the gracefully shut-down version as the template.

— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42145151 .

jywarren commented 10 years ago

sorry if you werent following that issue - I saw a second instance of the error on the dev server. scary. But i thought probably likely related to the copy operation:

https://github.com/publiclab/plots2/issues/59

btbonval commented 10 years ago

I saw it on the dev site, but in my head, I was relating it more to being the same as the problem on the main site. I suspect whatever caused the main site to corrupt a table likely caused the same corruption on dev.

On Sun, May 4, 2014 at 5:13 PM, Jeffrey Warren notifications@github.comwrote:

sorry if you werent following that issue - I saw a second instance of the error on the dev server. scary. But i thought probably likely related to the copy operation:

publiclab#59 https://github.com/publiclab/plots2/issues/59

— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42145226 .

jywarren commented 10 years ago

also, almost forgot to re-assign all TagSelections to the new tids. Oops.

jywarren commented 10 years ago

Also gotta consider capitalization.

btbonval commented 10 years ago

MySQL is usually case insensitive, but I guess if you're doing regex in ruby, then yeah case insensitive is a thing to do.

The easiest way is generally to apply a lower case function (built into the database or ruby) to the thing in question (tag in this case) before equality comparisons.

On Mon, May 12, 2014 at 6:50 AM, Jeffrey Warren notifications@github.comwrote:

Also gotta consider capitalization.

— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42833897 .

jywarren commented 10 years ago

OK, i believe this is ready for testing once the test server is up.

jywarren commented 10 years ago

i added removal of tags with spaces in names, and decapitalization and a log of each type of record before and after so we can make sure it tallies up.

jywarren commented 10 years ago

This line may create more duplicates if there are tags with spaces that become redundant with existing ones: https://github.com/jywarren/plots2/blob/consolidate-tags/db/migrate/20140429190219_consolidate_tags.rb#L132

jywarren commented 10 years ago

I ran it a few times, debugged on the test server, and once it finally worked, got the below output. However, some partial runs may have messed with it so I'm now reimporting the orig db and trying one more time, then I'll compare some wiki pages with old tags.

======= BEGIN TAG CONSOLIDATION ========
Tags:              9589
NodeTags:          4
CommunityNodeTags: 11179
========================================
Duplicate tags:    0
========================================
=======  END TAG CONSOLIDATION  ========
Tags:              9589
NodeTags:          4
CommunityNodeTags: 11179
========================================
Fewer Tags:             5584
Fewer NodeTags:         4
More CommunityNodeTags: 2
========================================
Duplicate tags:    0
========================================
jywarren commented 10 years ago

Now i get:

======= BEGIN TAG CONSOLIDATION ========
Tags:              9589
NodeTags:          4
CommunityNodeTags: 11179
========================================
Duplicate tags:    0
========================================
=======  END TAG CONSOLIDATION  ========
Tags:              4005
NodeTags:          0
CommunityNodeTags: 11181
========================================
Fewer Tags:             5584
Fewer NodeTags:         4
More CommunityNodeTags: 2
========================================
Duplicate tags:    0
========================================

I think too many tags are getting deleted... they couldn't really all be orphans... have to cross-check. Sampling of deleted ones:

community-mapping,mapping,frac-sand,fracking,silica,farm,ifarm,ifarm2014,farmhack,uas,uav,aerial-imaging,remote-sensing,sensor-networks,internet-of-things,event,boston,new-hampshire,agriculture,crops,northeast,garden,organizers,community,farm,ifarm,ifarm2014,farmhack,event,boston,new-hampshire,agriculture,crops,northeast,garden,balloon-mapping,kite-mapping,kite,kites,ndvi,near-infrared-camera,infragram,infrared,community,gsoc,gsoc2014,spectralworkbench,community,organizers,silica,air-quality,particulate-sensing,particulates,dust,southeast,southeast,western-north-carolina,list:plots-southeast,tabbed:notes,tabbed:wikis,list:plots-southeast,tabbed:notes,tabbed:wikis,southeast,regional,regional,regional,regional,region,region,region,region,region,nn,gulf-coast,southeast,list:plots-southeast,organizers,chapter,autokap,arduino,kap,panorama,servo,control,autokap,arduino,panorama,aerial-photography,on,india,development,zoom,lense,digital,camera,farm,ifarm,ifarm2014,farmhack,event,boston,new-hampshire,agriculture,crops,northeast,garden,3dprint,frac-sand,sand,silica,water,water-quality,groundwater,wisconsin
Fastie commented 10 years ago

Oh, is this why tags are disappearing from my most recent notes?

jywarren commented 10 years ago

no -- this is all being run on the test server. Can you point to which note? If there is an unrelated tag deletion bug, that's alarming!

jywarren commented 10 years ago

that's the whole point of the weeks-long attempt to get a good test server running... so we give high-risk code like this a thorough shakedown and diagnosis before running it on the real database and server

jywarren commented 10 years ago

I believe the deleted tags are all due to consolidation of DrupalTag -- all DrupalNodeCommunityTags are being pointed at the same DrupalTag instead of each having their own. For example, the "groundwater" tag, marked as deleted, still has the same # of instances -- presumably they were all just pointed at different, uncombined copies of DrupalTags named "groundwater" -

http://test.publiclab.org/tag/groundwater vs http://publiclab.org/tag/groundwater (this comparison will only be valid until we either reset the testing server or actually run the migration on the live server)

A comparison of http://test.publiclab.org with http://publiclab.org as of right now should help us assess if the migration worked. I'll update here if I reimport or anything changes.

jywarren commented 10 years ago

The migration also reported:

node_tags: 4 deleted invalids:

node_tags for active pages: 4 failed: 0 tags:

dupes: 2 deleted after migrating: proven-in-the-field,warren failed: 0 tags:

dupes: 6417

Fastie commented 10 years ago

This note ended up with onlly three of its 12 or so tags, but it was probably my error. Maybe there was a space in the list when I first published it. Did you know a single space stops the parsing and subsequent tags in the list are ignored? I added back all the tags I could remember. I was suspicious that my previous note had lost some tags too, but I can't be sure. I just blamed you because accepting undeserved blame is your job.

jywarren commented 10 years ago

:-) happy to accept it, of course! thanks for a keen eye as always. I believe we fixed the spaces-in-tags bug, but maybe not for all possible ways to tag -- can you tell us where you'd input those tags originally? or maybe we broke the fix at some point.

jywarren commented 10 years ago

@fastie, check out https://github.com/publiclab/plots2/issues/18

jywarren commented 10 years ago

Chris - i also "downcased" all tags in this migration, which makes some things easier on the code end and feeds my obsessive compulsive desire for conformity. But now's the time to try to convince me otherwise -- see the tag capitalization differences between http://publiclab.org/notes/cfastie/04-28-2014/the-titan-2-rig and http://test.publiclab.org/notes/cfastie/04-28-2014/the-titan-2-rig

jywarren commented 10 years ago

Some de-duplication is not happening correctly: http://test.publiclab.org/wiki/balloon-mapping has "proven-in-the-field" twice. Checking why, but http://publiclab.org/wiki/balloon-mapping has that tag with spaces and capitalization, so i think the downcasing etc happened after the de-duplication?

jywarren commented 10 years ago

4 NodeTags from the audit log is correct; the live server shows:

SELECT COUNT(*) FROM `term_node`
=> 4
jywarren commented 10 years ago

chris, did you re-add the 12 tags in http://publiclab.org/notes/cfastie/05-25-2014/ifarm-quad-flight ?

jywarren commented 10 years ago

Gotcha! on https://github.com/publiclab/plots2/blob/consolidate-tags/db/migrate/20140429190219_consolidate_tags.rb#L114 - it marks it as a duplicate but does not delete it. Adding deletion if it already exists for that node.

jywarren commented 10 years ago

now re-running the import for a final check

Fastie commented 10 years ago

I manually added nine tags to that note. My recollection is that multirotor, uas, and uav were already there. There was a long list before I published the note, and I don't know when most of that list disappeared.

jywarren commented 10 years ago

Hi, Chris - can you reopen https://github.com/publiclab/plots2/issues/18 if you think that's still not resolved?

btbonval commented 10 years ago

Yay Jeff! It's so close I can taste it.

Did you already write the code that makes sole use of Tag (instead of DrupalTag+DrupalCommunityTag)? Does that code prevent future duplicates from being added? I sort of lost track of where things were at in this thread ;)

On Mon, May 26, 2014 at 10:54 AM, Jeffrey Warren notifications@github.comwrote:

now re-running the import for a final check

— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-44207668 .

jywarren commented 10 years ago

We are only at removing the redunancy between DrupalNodeTag and DrupalNodeCommunityTag, which will then allow us to simplify some code. I'm also making sure there aren't duplicate DrupalTags, which will also help us optimize. We'll still have 2 tag tables, the one with the name and the m2m table connecting them to Node. I think that's useful as someday we may have UserTag joins... !!!

Fastie commented 10 years ago

I will experiment the next time I publish a note. What is the expected behavior? Should all of these produce the same result when you first publish a note:

jywarren commented 10 years ago

hey chris, can we move to the other issue for this? thanks!

jywarren commented 10 years ago

This looks good after a new attempt at migration. However, i'm seeing that thetag "warren" on the /wiki/balloon-mapping page was deleted, and i believe it was a DrupalNodeTag. That's one of only 2 deleted according to the audit log, but i want to know why.

jywarren commented 10 years ago

all 4 DrupalNodeTags ('warren', 'warren', 'Proven in the field' and 'Proven in thefield') are on node 22, which is the balloon-mapping page. Odd,but makes our likes much easier. Two are duplicates, which accounts for the deletion. We are good here. Any other last checks we want to do?

jywarren commented 10 years ago

Ah, i think i solved a big problem. It was deleting too many tags, as fixed in the commit i just made; we should now see far fewer orphaned deletions and the missing "warren" tag should show up. Re-importing db now.