Described in linear:
missing media files in pages leads to a 5xx when the resource name itself cannot be processed
as an example, consider a link (some_link)[/%2fa/b] -> this resolves to //a/b, which is not processable by our backend as it results in an error like so: Error when getting Git blob hash: Error: HEAD:images/technews//tvws-part2.JPG and in turn causes a 5xx.
this 5xx is caused because GitFileService (iirc) will try to determine the blob hash (see if can find the file on disk) and in doing so, it will get an error (as the file doesn't exist), leading to the 500.
see here for an example where this occurs
in order to reproduce this:
[ ] go to any site with an image
[ ] create or click on an existing markdown page
[ ] in the page, add an image
[ ] open network console
[ ] update the link to be /images/%2f
[ ] there should be a 5xx in the network console.
Additionally, we also see a fatal: path '.git' does not exist in 'HEAD' in the logs every time the workspace is refreshed.
This is because we rely on fileStats to get a list of directories, but forget to remove special directories such as .git.
Solution
Reason for this is that git under the hood does a file path normalision for checking if the file exists in local disk, but does not when checking when it exists in head.
consider these three examples:
We use fileStats to see if a file exists before passing it to GetGitBlobHash. issue with this is that while the file could exist in disk and still return a correct folder stats, this should still return a file not found error as the path given by the user is still incorrect. (am opting for this rather than normalising the path in the function GetGitBlobHash to keep the implementation of GetGitBlobHash lean and not think about potential path traversal attacks.)
for the special files, we simply do a filter to ignore files like .git. We know that the class of errors with the strings exists on disk and Error when getting Git blob hash mostly comes from the .git folder by looking at dd logs here.
ie all instances of exists on disk and Error when getting Git blob hash errors in prod for the past two weeks came from trying to access the .git folders.
Breaking Changes
[ ] Yes - this PR contains breaking changes
Details ...
[X] No - this PR is backwards compatible with ALL of the following feature flags in this doc
Before & After Screenshots
BEFORE:
AFTER:
Tests
in order to reproduce this:
[ ] go to any site with an image
[ ] create or click on an existing markdown page
[ ] in the page, add an image
[ ] open network console
[ ] update the link to be /images/%2f
[ ] there should be a 4xx in the network console.
[ ] go into datadog, there should not be any instance of fatal: path '.git' exists on disk, but not in 'HEAD'
Problem
Described in linear: missing media files in pages leads to a 5xx when the resource name itself cannot be processed
as an example, consider a link (some_link)[/%2fa/b] -> this resolves to //a/b, which is not processable by our backend as it results in an error like so: Error when getting Git blob hash: Error: HEAD:images/technews//tvws-part2.JPG and in turn causes a 5xx.
this 5xx is caused because GitFileService (iirc) will try to determine the blob hash (see if can find the file on disk) and in doing so, it will get an error (as the file doesn't exist), leading to the 500.
see here for an example where this occurs
in order to reproduce this:
Additionally, we also see a
fatal: path '.git' does not exist in 'HEAD'
in the logs every time the workspace is refreshed. This is because we rely onfileStats
to get a list of directories, but forget to remove special directories such as.git
.Solution
Reason for this is that git under the hood does a file path normalision for checking if the file exists in local disk, but does not when checking when it exists in head.
consider these three examples:
We use fileStats to see if a file exists before passing it to
GetGitBlobHash
. issue with this is that while the file could exist in disk and still return a correct folder stats, this should still return a file not found error as the path given by the user is still incorrect. (am opting for this rather than normalising the path in the functionGetGitBlobHash
to keep the implementation ofGetGitBlobHash
lean and not think about potential path traversal attacks.)for the special files, we simply do a filter to ignore files like
.git
. We know that the class of errors with the stringsexists on disk
andError when getting Git blob hash
mostly comes from the .git folder by looking at dd logs here.ie all instances of
exists on disk
andError when getting Git blob hash
errors in prod for the past two weeks came from trying to access the .git folders.Breaking Changes
Before & After Screenshots
BEFORE:
AFTER:
Tests
in order to reproduce this:
fatal: path '.git' exists on disk, but not in 'HEAD'