Closed GoogleCodeExporter closed 9 years ago
And I am not even sure that I have really "fixed" some of the bad links, as I
sometimes see colliding file names in different comments like this:
User comment 1: Something doesn't work, attaches config.txt
User comment 2: Workaround doesn't work either, attaches config.txt
Original comment by mrovi9...@gmail.com
on 18 Mar 2015 at 12:57
Thank you for the bug report. I will take a look.
Original comment by chrsm...@google.com
on 18 Mar 2015 at 3:21
Any update? We're looking to add an "Import from Google Code" feature to GitLab
and want to import attachments as well, but this issue is standing in our way.
Thank you.
Original comment by do...@gitlab.com
on 30 Mar 2015 at 9:16
We have added a wiki page detailing how our issue attachment mirroring works.
You can get issues without the additional token if you go straight to Google
Cloud Storage.
See: https://code.google.com/p/support-tools/wiki/IssueMirror
However our issue mirroring does have some known issues (we won't fix). We only
mirror _public_ issues. So attachments for private (Restrict-View-*) issues
will not be available. Similarly, we don't mirror attachments for deleted
issues. And, in the occasional situation where you upload two attachments at
the same commit, both having the same name, we only have one of the files on
GCS.
As for your bad-unfixable.txt file, you have uncovered something afoot in
Google Code. For example:
Bad link:
https://storage.googleapis.com/google-code-attachments/tint2/issue-471/comment-3
4/fedora-2015-03-15T10-43-31-220064000Z.webm for issue 471, comment 35,
attachment {"mimetype": "application/octet-stream", "attachmentId":
"4710035000", "fileSize": 10148504, "fileName":
"fedora-2015-03-15T10-43-31-220064000Z.webm"}
The attachment file is actually found in our mirror, though as comment #31:
https://storage.googleapis.com/google-code-attachments/tint2/issue-471/comment-3
1/fedora-2015-03-15T10-43-31-220064000Z.webm
I have no explanation for the discrepancy, other than I probably wrote the
errant code.
Could you tell me more about how your "Import from Google Code" feature of
GitLab works? I take it GitLab supports arbitrary file attachments, so you need
to download the attachments at the time you do the import? If so, that's a
great feature. But note that there is a delay between when issue attachments
are uploaded to Google Code and when they are mirrored onto Google Cloud
Storage.
Original comment by chrsm...@google.com
on 31 Mar 2015 at 12:11
The known restrictions to the IssueMirror are not a problem.
As for migrating public attachments, we have two options, since GitLab does
indeed support arbitrary file attachments:
1. Download all attachments from the IssueMirror and reupload to GitLab when a
user requests a project import.
2. Link directly to the attachment on the IssueMirror from the new GitLab issue.
Both options depend on the IssueMirror URLs actually working, which
mrovi9...@gmail.com reports they aren't currently. The first option also
requires that we aren't rate limited or otherwise blocked by Google Cloud
Storage for downloading large numbers of files. The second option is only
viable if there is a guarantee that the IssueMirror will stay up indefinitely.
Downsides to the first option would be the storage and bandwidth requirements
on our side, and the mirror delay you mention (how large is that delay?) The
main downside to the second option would be the ongoing dependency on Google
Storage :)
We are ultimately fine with either option, but we need to be sure the
IssueMirror URLs work.
Original comment by do...@gitlab.com
on 31 Mar 2015 at 12:56
I think you should go with option two, if for no other reason than its
simplicity. As for your concerns about how long the issue mirror will stay
around, it simply is a Google Cloud Storage bucket. So the overhead is
negligible. Read: if I get hit by a bus, the data will still be there. And
unlike other parts of Google Code, issue attachments aren't as problematic of
an abuse vector.
As for the delay in mirroring, at worst it will be a few days. Due to come
technical limitations in how to bridge security from our internal data centers
and the external Google Cloud Storage, I need to run the migration manually.
I will be looking at those `bad-unfixable.txt` attachments today, as it
certainly is a bug somewhere. I'll update this issue when I've hunted it down...
Original comment by chrsm...@google.com
on 1 Apr 2015 at 7:45
All right, thank you Chris. I wasn't sure if this bucket was meant to be
permanent or if the plan was to let it go after some amount of time had passed,
but I guess there's really no point to since the amount of data is negligible
in the grander scheme of things.
Good luck with `bad-unfixable.txt`, I hope you figure it out.
Original comment by do...@gitlab.com
on 1 Apr 2015 at 9:43
In case you were curious, the underlying issue has to do with deleted comments.
We were mirroring the issue as comment #31, but displaying the attachment on
the site as comment #35. That's because before we render the web page we go
through _all_ comments. (Including those that have been deleted[1].) Whereas
the issue mirror just goes through "live" comments.
This mismatch is why the attachment was put in the wrong place. There were four
deleted comments before the attachment was put up. So while Google Code says
you are looking at comment #35, in actuality is is only the 31st LIVE comment.
See the following comments, notice that the one after it isn't shown. For
example, there is a comment #10 and #12, but no #11.
https://code.google.com/p/tint2/issues/detail?id=471#c10
https://code.google.com/p/tint2/issues/detail?id=471#c19
https://code.google.com/p/tint2/issues/detail?id=471#c27
https://code.google.com/p/tint2/issues/detail?id=471#c30
Fixing it might be a pain, but at least we know what the problem is.
[1] Deleting data in large-scale replicated datastores is actually difficult.
So many times things get deleted by simply clearing a "LIVE" field, and
possibly zeroing out the data; but still leaving the placeholder object in
place. This way you don't also have to move lots of data around since you now
have an XX byte hole inside of a YY Gigabyte file.
Original comment by chrsm...@google.com
on 1 Apr 2015 at 10:01
So I'm assuming you're planning to actually fix the data in the bucket? It
would be easiest to simply change the docs to say the ID is based on how many
live comments come before it, but I guess that will mess up the numbering when
a comment is deleted after an attachment is mirrored.
Original comment by do...@gitlab.com
on 1 Apr 2015 at 10:13
Thanks for the explanation. I can confirm that with the attached script I do
not see any bad URLs in project tint2 :D
IIUC things may still go out of sync if someone deletes a comment after you run
the mirroring script but before the takeout is generated; and they will become
in sync again the next time you run the mirroring script?
Original comment by mrovi9...@gmail.com
on 1 Apr 2015 at 10:49
Attachments:
Looking at how we surface the data (HTML frontend, Google Takeout JSON dump,
and the GitHub exporter) it seems like the thing we need to fix is the issue
mirror's counting schema.
That, unfortunately might require some major changes because of how that system
works.
#Summary#
Google Code comments can be deleted. In the HTML frontend, we number comments
including these deleted comments. Similarly, in the Google Code takeout we
include deleted comments in the data dump.
The problem is that in the Issue Mirror we ignore deleted comments, so when we
render a link to issue X, comment Y. "Y" refers to the Yth LIVE comment. Not
the Yth comment overall.
As a workaround, you will need to filter out non-LIVE comments in from your
Google Takeout dump, as @mrovi9000 did in their script.
Similarly, douwe@gitlab.com, you will need to take note of the number of
comments you see when scraping Google Code output; and not look at the comment
numbers we display in the HTML.
Re: "things may still go out of sync if someone deletes a comment after you run
the mirroring script but before the takeout is generated"
Correct. When the mirroring process runs, it will upload an attachment to
something like .../tint2/issue-X/comment-10/... Now if a comment BEFORE the
attachment comment gets deleted, AND you then run Google Takeout to export your
project issues, THEN the comment numbers won't line up.
Re: "will they become in sync again the next time you run the mirroring script?"
No. Currently, the code that does the mirroring exists in a world that doesn't
know anything about deleted comments. So as far as it knows, the comment
numbers it generates are correct.
So it seems like getting the Issue Mirror to be aware of deleted comments will
fix all known problems, so that you can just link to comment X (where X
includes both LIVE and deleted issues).
I'll start looking into how we can fix this. Sorry for the inconvenience! Once
I get the Issue Mirror attachments to include counts from deleted issues, you
should just be able to generate links to Google Cloud Storage as you would
expect. (Either using the comment numbers from our HTML, or directly from the
Google Takeout JSON dump.)
Original comment by chrsm...@google.com
on 2 Apr 2015 at 12:01
Thanks for looking into this! We're planning to include Google Code import in
GitLab 7.10, due to be released April 22nd with code freeze around the 14th.
I'm hopeful this IssueMirror issue will be fixed before then.
As an aside, is there a reason why the attachment URLs aren't included in the
Takeout dump directly? That way the specific way of counting wouldn't have
mattered anyway, as long as the right JSON attachment object had the right URL.
Original comment by do...@gitlab.com
on 2 Apr 2015 at 8:03
As of right now all attachments on Google Code should exist on Google Cloud
Storage with the right comment ID. (This impacted < 5% of all issues mirrored.)
If you see any problems please let me know.
I'll update the IssueMirror FAQ to clarify this, but the issue mirror now
exports issues to match the exact comment number that you see on the site. So
the attachment at comment #X should be on Google Cloud Storage with comment-X.
If any files are attached to an issue when it is initially reported, that is
considered "comment #0".
This also applies to Google Takeout JSON dumps. The "id" property of each
comment should correspond to the Google Cloud Storage bucket number.
Re: "is there a reason why the attachment URLs aren't included in the Takeout
dump directly?"
The issue mirror didn't exist at the time we wrote the Google Takeout support.
As for why not to add direct download links now, it's a known issue. I just
haven't been able to get to it yet.
Re: "We're planning to include Google Code import in GitLab 7.10, due to be
released April 22nd with code freeze around the 14th."
Sounds great. Please contact me (chrsmith@google.com) if there is anything I
can do for you. If you sent me a link or doc with how the system works, I'd be
happy to mention it in the project's wiki.
Original comment by chrsm...@google.com
on 2 Apr 2015 at 9:50
Thanks Chris, I'll let you know if we need anything else.
Original comment by do...@gitlab.com
on 3 Apr 2015 at 9:21
Original issue reported on code.google.com by
mrovi9...@gmail.com
on 18 Mar 2015 at 12:52Attachments: