Open gurupras opened 1 year ago
It'd be interesting to see if some spec dictates a "correct" behavior for this case, because I think Node.js behavior is "technically more correct" here. /cc @nodejs/url
Here is the part referencing the pathname state in URL specification: https://url.spec.whatwg.org/#path-state
Pathname can consist of unlimited slashes. Specification states that, for path state if entry is /
and: Otherwise, if buffer is not a single-dot URL path segment, then: Append buffer to url’s path.`
This means that URLs can have unlimited /
in their pathname. This is why the following succeeds:
> new URL('file://///////////some-server-cifs/vmgr/17/testdir/foo.txt')
URL {
href: 'file://///////////some-server-cifs/vmgr/17/testdir/foo.txt',
origin: 'null',
protocol: 'file:',
username: '',
password: '',
host: '',
hostname: '',
port: '',
pathname: '///////////some-server-cifs/vmgr/17/testdir/foo.txt',
search: '',
searchParams: URLSearchParams {},
hash: ''
}
I would expect either v8 to return valid URLs or NodeJS' url.fileURLToPath() to sanitize the "slightly malformed" URL and parse it correctly just like Chrome and Windows Explorer do.
According to the URL specification, this input is not malformed, and 100% valid.
FWIW, python3 is also perfectly happy to handle this URL
Python's default URL parser is not compliant to WHATWG. Node.js built-in URL parser (Ada), has a Python library which demonstrates couple of major differences: https://github.com/ada-url/ada-python
@anonrig I'd be more interested in a spec that tells us how to get a file path from a URL, which is the question here. From the OP, here's the current state of things (✅ means "the URL is interpreted as \\some-server-cifs\vmgr\17\testdir\foo.txt
", ❌ means it throws an error):
URL | Python | Node.js |
---|---|---|
file://some-server-cifs/vmgr/17/testdir/foo.txt |
✅ | ✅ |
file:////some-server-cifs/vmgr/17/testdir/foo.txt |
✅ | ❌ |
Is Node.js throwing in the second case something we actually want, or should we align with the behavior of Python?
Python uses a combination of RFC 3986 and WHATWG, but mainly to RFC 3986.
According to the urllib.parse
documentation: https://docs.python.org/3/library/urllib.parse.html
When you parse new URL('file://some-server-cifs/vmgr/17/testdir/foo.txt')
, it gives you a hostname of some-server-cifs
which results into a UNC path.
But when you parse new URL('file:////some-server-cifs/vmgr/17/testdir/foo.txt')
it returns a hostname of ''
because the starting character of pathname is /
(except for certain edge cases). Therefore, hostname returning empty string, therefore expects the second character of the pathname to be :
in order to accept it as a absolute URL.
hostname returning empty string, therefore expects the second character of the pathname to be
:
in order to accept it as a absolute URL.
You say "therefore", but I don’t see what’s the logical link between the two. Who said that UNC paths could not be represented with a file: URL that has an empty string for hostname? Is there a spec somewhere or is it something Node.js came up with?
According to WHATWG, both URLs are valid, but fileURLToPath
expects either a URL with a hostname (such as UNC), or a URL that consists of a Windows drive letter (which is something like file:///c:/
) for Windows (not for posix).
This is the flow chart for parsing file:////hello
in WHATWG spec:
flowchart LR
file_state --> file_slash_state
file_slash_state --> file_host_state
file_host_state --> path_start_state
path_start_state --> path_state
According to my understanding, file:////Hello
is not a valid path because it does not start with a Windows drive letter, since //Hello
is not a valid path in Windows.
What I'm having trouble understanding is, can Windows have a file at ////some-server-cifs/
path?
What I'm having trouble understanding is, can Windows have a file at ////some-server-cifs/ path?
I don't think the path is valid. I don't know how to get C/C++ running on Windows, but Golang reports the path is invalid.
Not sure if this helps, but Python has the following code to deal with paths like these:
if url[:4] == '////':
# path is something like ////host/path/on/remote/host
# convert this to \\host\path\on\remote\host
# (notice halving of slashes at the start of the path)
url = url[2:]
This code is invoked from url2pathname
called from open_local_file
I think that we have established that the URL is valid as per WHATWG.
It would be interesting how you end up with this URL in the first place?
Like I mentioned in the original post, these are URLs generated and returned by v8's Profiler.takePreciseCoverage
when the files in question are hosted on a networked filesystem.
We ran into this while using Jest to run tests and capture code-coverage for code that exists on a networked filesystem. For code-coverage, our Jest configuration uses v8
. Internally, jest outsources the v8 code-coverage logic to a lightweight library, collect-v8-coverage, that wraps inspector
calls. When stepping through Jest source-code, I noticed that the URLs were the result of collect-v8-coverage's stopInstrumenting()
API, which internally calls Profiler.takePreciseCoverage
.
According to my understanding,
file:////Hello
is not a valid path because it does not start with a Windows drive letter, since//Hello
is not a valid path in Windows.
I don't think this is true, because while //Hello
is not a valid path in Windows, \\Hello
is.
Some background here. There's essentially no standard that's written down anywhere regarding Windows file
URL parsing. The position that WHATWG's URL spec has taken is to represent Windows paths as follows:
C:\abc → file:///C:/abc
\\host\path → file://host/path
Since Node.js uses the WHATWG interpretation mostly, that's what getPathFromURLWin32
does also.[^1]
However, it seems like V8's Profiler.takePreciseCoverage
returns a file URL that doesn't conform to the WHATWG style, though Chrome and Windows Explorer (which I'm guessing uses some variant of PathCreateFromUrlW
) do support it.
As such, there are essentially two options:
getPathFromURLWin32
to be more lenient in parsing file URLs (like Python), orBoth would be fine with me. However, considering:
file
URLs, so V8 might require vastly different workarounds compared to another application that uses file
URLs slightly differently,I believe the second option is probably better.
[^1]: I believe the main reason behind WHATWG's decision is the desire to avoid platform-dependent behavior in the parser. Chrome uses an entirely separate file
URL parser when it's running on Windows vs. POSIX systems.
Thanks @TimothyGu.
If Node.js goes with option 2, then I suggest it should be documented (in the Node.js public documentation).
cc @anonrig
Version
16.15.1
Platform
Microsoft Windows NT 10.0.19045.0 x64
Subsystem
url
What steps will reproduce the bug?
This might be a bug in v8 rather than node's
url
module, but I'm reporting here first since I'm not sufficiently knowledgeable in C++ or with the URL spec to determine if the bug is in v8 or where it is.On networked filesystems on Windows, v8's
Profiler.takePreciseCoverage
ends up returning URLs that are of the formfile:////some-server-cifs/<path-to-js-file>
. While the URL appears to have 'too many' slashes, Chrome and Windows Explorer are perfectly happy to resolve this. However, node throws some errors.On Windows, create a file,
test.mjs
, that contains:Now run
node test.mjs
.How often does it reproduce? Is there a required condition?
The code provided will reproduce this issue every time on Windows
What is the expected behavior? Why is that the expected behavior?
I would expect either v8 to return valid URLs or NodeJS'
url.fileURLToPath()
to sanitize the "slightly malformed" URL and parse it correctly just like Chrome and Windows Explorer do.What do you see instead?
Additional information
If it is deemed that there is some kind of bug in NodeJS regarding this, I'm hoping
url.fileURLToPath
can handle this gracefully.FWIW, python3 is also perfectly happy to handle this URL