Closed jywarren closed 10 years ago
I have a draft of this in the consolidate-tags branch of my repo (linked to just above) but want to try this out on an up to date db copy first. Asking Dogi for a test server. This also echoes all deleted things to console so we can visually confirm things that go right/wrong. We keep old copies of the whole db, so at the end of the day we have an "oh $#!*" backup plan too.
The whole thing is also wrapped in a transaction, and I've added lines converting all active tables to InnoDB so that'll actually work.
Haven't added a step removing spaces from tags, but we can do that too. Also once this is done, we can go in and remove a lot of extra lines where we had to query both DrupalNodeCommunityTag and DrupalNodeTag. Then we can do more efficient joins and lots of other fun things.
For more github issue searching power: deduplicate deduplication duplicate duplication tags tag
http://publiclab.org/wiki/gsoc-2014 has two "gsoc-2014" tags. Jeff thinks this might be two distinct kinds of tag with the same name. I could go into the database to check or I could wait to see if Jeff's dedup fix takes care of the issue whenever it gets run. Probably the latter is more efficient.
This seems to have run correctly on http://dev.publiclab.org; a log of deleted orphaned DrupalTags is stored in consolidated-tags.log in /home/warren of dev.publiclab.org
One exception: it did delete the DrupalTag for "all notes EVAR!1!" -- which I believe deletes the universal subscription.
What's the best way to confirm that this worked without destroying data?
Yeah the "all notes EVAR!1!" tag is a special one. I believe it doesn't matter what it is called, so long as it is id=0. There shouldn't be a problem as long as some tag has id=0.
Did you change the model name or just the underlying table for the new Tag?
I imagine the best way to check it is to 1) click some known posts to see if their tags are still in order and displayed (compare dev to regular) 2) try to subscribe to a tag 3) create a new post with that tag
I suppose step 3 doesn't matter since email isn't gonna happen. Maybe rewrite the mailer to drop the email to a file?
On Sun, May 4, 2014 at 4:33 PM, Jeffrey Warren notifications@github.comwrote:
This seems to have run correctly on http://dev.publiclab.org; a log of deleted orphaned DrupalTags is stored in consolidated-tags.log in /home/warren of dev.publiclab.org
One exception: it did delete the DrupalTag for "all notes EVAR!1!" -- which I believe deletes the universal subscription.
What's the best way to confirm that this worked without destroying data?
— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144062 .
I didn't change the model or the name; just:
Overall, things check out. My worry is that some old DrupalNodeTags were not ported over, or that in the consolidation, we may have lost DrupalNodeCommunityTags. I just need to do this rigorously, checking a sampling of those, to see that they made it. Maybe also tally total # of each and be sure the #s still check out. But this seems to have gone well.
The next step would be to change the table/model names to just "Tag" and get rid of all redundant code, which should be fun as we'll get to be more optimal.
All of these steps are annotated in the migration, so you can check if you have a moment: https://github.com/jywarren/plots2/blob/consolidate-tags/db/migrate/20140429190219_consolidate_tags.rb#L28
Okay. Just making sure the subscription code won't break for ALL NOTES EVER?!! which hard codes DrupalTag check for id 0. Sad pandas were sad that day, b/c hard code.
When you say "orphaned" tag, do you mean tags that are not associated with any notes? Should a tag be considered orphaned if a user subscribed to the tag pre-emptively? Or in the case of ALL NOTES EVEAR!!1!, it was never associated with a note, but users definitely subscribe to it.
On Sun, May 4, 2014 at 4:48 PM, Jeffrey Warren notifications@github.comwrote:
I didn't change the model or the name; just:
- consolidated everything into the DrupalNode > DrupalNodeCommunityTag > DrupalTag association rather than DrupalNode > DrupalNodeTag > DrupalTag
- edited all DrupalNodeCommunityTags to point to the same DrupalTag if they have the same DrupalTag.name
- deleted all orphaned DrupalTags
- ensured that adding new tags adds to an existing DrupalTag instead of creating a new one
Overall, things check out. My worry is that some old DrupalNodeTags were not ported over, or that in the consolidation, we may have lost DrupalNodeCommunityTags. I just need to do this rigorously, checking a sampling of those, to see that they made it. Maybe also tally total # of each and be sure the #s still check out. But this seems to have gone well.
The next step would be to change the table/model names to just "Tag" and get rid of all redundant code, which should be fun as we'll get to be more optimal.
— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144474 .
yeah i'm def. going to fix the ALL NOTES exception before merging this into master
Oh, i had forgotten that we can now have tags with no nodes but that do have subscriptions. You're right, i'll have to remove those from the definition of "orphaned". That should solve the ALL NOTES thing too. Good catch!
Yeah we're sorta talking past each other in parallel. You sent me the link to code while I asked the question which the code answered.
|| nt.tag.tag_selection.nil?
tag_selection belongs to drupal_tag, but I notice drupal_tag doesn't has_many tag_selection. so that might need to get fixed for the above to work? Not sure if ActiveRecord implicitly figures out forwards and backwards relationships the way all Ruby infers everything, or if one really does need to supply both belongs_to and has_many.
On Sun, May 4, 2014 at 4:53 PM, Jeffrey Warren notifications@github.comwrote:
yeah i'm def. going to fix the ALL NOTES exception before merging this into master
Oh, i had forgotten that we can now have tags with no nodes but that do have subscriptions. You're right, i'll have to remove those from the definition of "orphaned". That should solve the ALL NOTES thing too. Good catch!
— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144646 .
i actually just added a has_many :tag_selection but yep. I'll take a break from coding now so we don't expend any more redundant brain power :-)
thanks bryan! btw, it's great peace of mind to have this dev.publiclab.org server to test things on; although now I have to reimport a fresh copy of the database to do it properly.
Yup. I like virtual machine appliances running on something like VirtualBo. Once you get the thing up and running, you save the appliance (which can be less than 2 GiB even if the hard drive is 10 GiB large), destroy the machine when you mess up, reload the appliance (which is pretty quick), and then try again.
It's basically a video game save state for an entire system.
dev.publiclab.org might be a VM, but I don't see you having the nice ability to save and reload state that way I would normally do with VirtualBox. At least not without Dogi getting involved. -Bryan
On Sun, May 4, 2014 at 5:02 PM, Jeffrey Warren notifications@github.comwrote:
i actually just added a has_many :tag_selection but yep. I'll take a break from coding now so we don't expend any more redundant brain power :-)
thanks bryan! btw, it's great peace of mind to have this dev.publiclab.orgserver to test things on; although now I have to reimport a fresh copy of the database to do it properly.
— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144917 .
Downside to virtualbox style VMs: they don't run on my netbook at all.
Remote VMs like Dogi sets up are really nice for netbook dev work ;)
On Sun, May 4, 2014 at 5:05 PM, Bryan btbonval@gmail.com wrote:
Yup. I like virtual machine appliances running on something like VirtualBo. Once you get the thing up and running, you save the appliance (which can be less than 2 GiB even if the hard drive is 10 GiB large), destroy the machine when you mess up, reload the appliance (which is pretty quick), and then try again.
It's basically a video game save state for an entire system.
dev.publiclab.org might be a VM, but I don't see you having the nice ability to save and reload state that way I would normally do with VirtualBox. At least not without Dogi getting involved. -Bryan
On Sun, May 4, 2014 at 5:02 PM, Jeffrey Warren notifications@github.comwrote:
i actually just added a has_many :tag_selection but yep. I'll take a break from coding now so we don't expend any more redundant brain power :-)
thanks bryan! btw, it's great peace of mind to have this dev.publiclab.org server to test things on; although now I have to reimport a fresh copy of the database to do it properly.
— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144917 .
Ah, we should ask him if that's possible. Even just saving a copy of the database to overwrite would get us most of the way there. On May 4, 2014 5:05 PM, "Bryan Bonvallet" notifications@github.com wrote:
Yup. I like virtual machine appliances running on something like VirtualBo. Once you get the thing up and running, you save the appliance (which can be less than 2 GiB even if the hard drive is 10 GiB large), destroy the machine when you mess up, reload the appliance (which is pretty quick), and then try again.
It's basically a video game save state for an entire system.
dev.publiclab.org might be a VM, but I don't see you having the nice ability to save and reload state that way I would normally do with VirtualBox. At least not without Dogi getting involved. -Bryan
On Sun, May 4, 2014 at 5:02 PM, Jeffrey Warren notifications@github.comwrote:
i actually just added a has_many :tag_selection but yep. I'll take a break from coding now so we don't expend any more redundant brain power :-)
thanks bryan! btw, it's great peace of mind to have this dev.publiclab.orgserver to test things on; although now I have to reimport a fresh copy of the database to do it properly.
— Reply to this email directly or view it on GitHub< https://github.com/jywarren/plots2/issues/181#issuecomment-42144917> .
— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42144986 .
Destroying the VM and rebooting it from an image will almost certainly be as fast or faster than destroying and loading the database. The image is just a binary bit-for-bit thing, whereas the database tries to process data as it goes, usually with text and numbers and junk.
Definitely worth asking Dogi if there's a save state or a machine export/import option to speed up dev system state recovery.
well, the save/reinstate may have been responsible for the database error i got when first starting this one up. So we'd want to take the main site down gracefully before making a nice copy or something. Or we could nicely shut down a clean copy of dev.publiclab.org, and use the gracefully shut-down version as the template.
I thought that table corruption error was present on the main public lab as well? Isn't that why new notes couldn't be posted?
Saving state shouldn't cause a problem for the current image, it might corrupt the machine running from a restored image though.
On Sun, May 4, 2014 at 5:11 PM, Jeffrey Warren notifications@github.comwrote:
well, the save/reinstate may have been responsible for the database error i got when first starting this one up. So we'd want to take the main site down gracefully before making a nice copy or something. Or we could nicely shut down a clean copy of dev.publiclab.org, and use the gracefully shut-down version as the template.
— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42145151 .
sorry if you werent following that issue - I saw a second instance of the error on the dev server. scary. But i thought probably likely related to the copy operation:
I saw it on the dev site, but in my head, I was relating it more to being the same as the problem on the main site. I suspect whatever caused the main site to corrupt a table likely caused the same corruption on dev.
On Sun, May 4, 2014 at 5:13 PM, Jeffrey Warren notifications@github.comwrote:
sorry if you werent following that issue - I saw a second instance of the error on the dev server. scary. But i thought probably likely related to the copy operation:
publiclab#59 https://github.com/publiclab/plots2/issues/59
— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42145226 .
also, almost forgot to re-assign all TagSelections to the new tids. Oops.
Also gotta consider capitalization.
MySQL is usually case insensitive, but I guess if you're doing regex in ruby, then yeah case insensitive is a thing to do.
The easiest way is generally to apply a lower case function (built into the database or ruby) to the thing in question (tag in this case) before equality comparisons.
On Mon, May 12, 2014 at 6:50 AM, Jeffrey Warren notifications@github.comwrote:
Also gotta consider capitalization.
— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-42833897 .
OK, i believe this is ready for testing once the test server is up.
i added removal of tags with spaces in names, and decapitalization and a log of each type of record before and after so we can make sure it tallies up.
This line may create more duplicates if there are tags with spaces that become redundant with existing ones: https://github.com/jywarren/plots2/blob/consolidate-tags/db/migrate/20140429190219_consolidate_tags.rb#L132
I ran it a few times, debugged on the test server, and once it finally worked, got the below output. However, some partial runs may have messed with it so I'm now reimporting the orig db and trying one more time, then I'll compare some wiki pages with old tags.
======= BEGIN TAG CONSOLIDATION ======== Tags: 9589 NodeTags: 4 CommunityNodeTags: 11179 ======================================== Duplicate tags: 0 ======================================== ======= END TAG CONSOLIDATION ======== Tags: 9589 NodeTags: 4 CommunityNodeTags: 11179 ======================================== Fewer Tags: 5584 Fewer NodeTags: 4 More CommunityNodeTags: 2 ======================================== Duplicate tags: 0 ========================================
Now i get:
======= BEGIN TAG CONSOLIDATION ======== Tags: 9589 NodeTags: 4 CommunityNodeTags: 11179 ======================================== Duplicate tags: 0 ======================================== ======= END TAG CONSOLIDATION ======== Tags: 4005 NodeTags: 0 CommunityNodeTags: 11181 ======================================== Fewer Tags: 5584 Fewer NodeTags: 4 More CommunityNodeTags: 2 ======================================== Duplicate tags: 0 ========================================
I think too many tags are getting deleted... they couldn't really all be orphans... have to cross-check. Sampling of deleted ones:
community-mapping,mapping,frac-sand,fracking,silica,farm,ifarm,ifarm2014,farmhack,uas,uav,aerial-imaging,remote-sensing,sensor-networks,internet-of-things,event,boston,new-hampshire,agriculture,crops,northeast,garden,organizers,community,farm,ifarm,ifarm2014,farmhack,event,boston,new-hampshire,agriculture,crops,northeast,garden,balloon-mapping,kite-mapping,kite,kites,ndvi,near-infrared-camera,infragram,infrared,community,gsoc,gsoc2014,spectralworkbench,community,organizers,silica,air-quality,particulate-sensing,particulates,dust,southeast,southeast,western-north-carolina,list:plots-southeast,tabbed:notes,tabbed:wikis,list:plots-southeast,tabbed:notes,tabbed:wikis,southeast,regional,regional,regional,regional,region,region,region,region,region,nn,gulf-coast,southeast,list:plots-southeast,organizers,chapter,autokap,arduino,kap,panorama,servo,control,autokap,arduino,panorama,aerial-photography,on,india,development,zoom,lense,digital,camera,farm,ifarm,ifarm2014,farmhack,event,boston,new-hampshire,agriculture,crops,northeast,garden,3dprint,frac-sand,sand,silica,water,water-quality,groundwater,wisconsin
Oh, is this why tags are disappearing from my most recent notes?
no -- this is all being run on the test server. Can you point to which note? If there is an unrelated tag deletion bug, that's alarming!
that's the whole point of the weeks-long attempt to get a good test server running... so we give high-risk code like this a thorough shakedown and diagnosis before running it on the real database and server
I believe the deleted tags are all due to consolidation of DrupalTag -- all DrupalNodeCommunityTags are being pointed at the same DrupalTag instead of each having their own. For example, the "groundwater" tag, marked as deleted, still has the same # of instances -- presumably they were all just pointed at different, uncombined copies of DrupalTags named "groundwater" -
http://test.publiclab.org/tag/groundwater vs http://publiclab.org/tag/groundwater (this comparison will only be valid until we either reset the testing server or actually run the migration on the live server)
A comparison of http://test.publiclab.org with http://publiclab.org as of right now should help us assess if the migration worked. I'll update here if I reimport or anything changes.
The migration also reported:
node_tags: 4 deleted invalids:
node_tags for active pages: 4 failed: 0 tags:
dupes: 2 deleted after migrating: proven-in-the-field,warren failed: 0 tags:
dupes: 6417
This note ended up with onlly three of its 12 or so tags, but it was probably my error. Maybe there was a space in the list when I first published it. Did you know a single space stops the parsing and subsequent tags in the list are ignored? I added back all the tags I could remember. I was suspicious that my previous note had lost some tags too, but I can't be sure. I just blamed you because accepting undeserved blame is your job.
:-) happy to accept it, of course! thanks for a keen eye as always. I believe we fixed the spaces-in-tags bug, but maybe not for all possible ways to tag -- can you tell us where you'd input those tags originally? or maybe we broke the fix at some point.
@fastie, check out https://github.com/publiclab/plots2/issues/18
Chris - i also "downcased" all tags in this migration, which makes some things easier on the code end and feeds my obsessive compulsive desire for conformity. But now's the time to try to convince me otherwise -- see the tag capitalization differences between http://publiclab.org/notes/cfastie/04-28-2014/the-titan-2-rig and http://test.publiclab.org/notes/cfastie/04-28-2014/the-titan-2-rig
Some de-duplication is not happening correctly: http://test.publiclab.org/wiki/balloon-mapping has "proven-in-the-field" twice. Checking why, but http://publiclab.org/wiki/balloon-mapping has that tag with spaces and capitalization, so i think the downcasing etc happened after the de-duplication?
4 NodeTags from the audit log is correct; the live server shows:
SELECT COUNT(*) FROM `term_node`=> 4
chris, did you re-add the 12 tags in http://publiclab.org/notes/cfastie/05-25-2014/ifarm-quad-flight ?
Gotcha! on https://github.com/publiclab/plots2/blob/consolidate-tags/db/migrate/20140429190219_consolidate_tags.rb#L114 - it marks it as a duplicate but does not delete it. Adding deletion if it already exists for that node.
now re-running the import for a final check
I manually added nine tags to that note. My recollection is that multirotor, uas, and uav were already there. There was a long list before I published the note, and I don't know when most of that list disappeared.
Hi, Chris - can you reopen https://github.com/publiclab/plots2/issues/18 if you think that's still not resolved?
Yay Jeff! It's so close I can taste it.
Did you already write the code that makes sole use of Tag (instead of DrupalTag+DrupalCommunityTag)? Does that code prevent future duplicates from being added? I sort of lost track of where things were at in this thread ;)
On Mon, May 26, 2014 at 10:54 AM, Jeffrey Warren notifications@github.comwrote:
now re-running the import for a final check
— Reply to this email directly or view it on GitHubhttps://github.com/jywarren/plots2/issues/181#issuecomment-44207668 .
We are only at removing the redunancy between DrupalNodeTag and DrupalNodeCommunityTag, which will then allow us to simplify some code. I'm also making sure there aren't duplicate DrupalTags, which will also help us optimize. We'll still have 2 tag tables, the one with the name and the m2m table connecting them to Node. I think that's useful as someday we may have UserTag joins... !!!
I will experiment the next time I publish a note. What is the expected behavior? Should all of these produce the same result when you first publish a note:
hey chris, can we move to the other issue for this? thanks!
This looks good after a new attempt at migration. However, i'm seeing that thetag "warren" on the /wiki/balloon-mapping page was deleted, and i believe it was a DrupalNodeTag. That's one of only 2 deleted according to the audit log, but i want to know why.
all 4 DrupalNodeTags ('warren', 'warren', 'Proven in the field' and 'Proven in thefield') are on node 22, which is the balloon-mapping page. Odd,but makes our likes much easier. Two are duplicates, which accounts for the deletion. We are good here. Any other last checks we want to do?
Ah, i think i solved a big problem. It was deleting too many tags, as fixed in the commit i just made; we should now see far fewer orphaned deletions and the missing "warren" tag should show up. Re-importing db now.
remove spaced tags too!