nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.62k stars 29.07k forks source link

Possible bug in node:internal/url getpathfromurlwin32 #48530

Open gurupras opened 1 year ago

gurupras commented 1 year ago

Version

16.15.1

Platform

Microsoft Windows NT 10.0.19045.0 x64

Subsystem

url

What steps will reproduce the bug?

This might be a bug in v8 rather than node's url module, but I'm reporting here first since I'm not sufficiently knowledgeable in C++ or with the URL spec to determine if the bug is in v8 or where it is.

On networked filesystems on Windows, v8's Profiler.takePreciseCoverage ends up returning URLs that are of the form file:////some-server-cifs/<path-to-js-file>. While the URL appears to have 'too many' slashes, Chrome and Windows Explorer are perfectly happy to resolve this. However, node throws some errors.

On Windows, create a file, test.mjs, that contains:

import { fileURLToPath } from 'url'

const url = 'file:////some-server-cifs/vmgr/17/testdir/foo.txt'
const p = fileURLToPath(url)

Now run node test.mjs.

How often does it reproduce? Is there a required condition?

The code provided will reproduce this issue every time on Windows

What is the expected behavior? Why is that the expected behavior?

I would expect either v8 to return valid URLs or NodeJS' url.fileURLToPath() to sanitize the "slightly malformed" URL and parse it correctly just like Chrome and Windows Explorer do.

What do you see instead?

node:internal/errors:465
    ErrorCaptureStackTrace(err);
    ^

TypeError [ERR_INVALID_FILE_URL_PATH]: File URL path must be absolute
    at new NodeError (node:internal/errors:372:5)
    at getPathFromURLWin32 (node:internal/url:1393:11)
    at fileURLToPath (node:internal/url:1423:22)
    at file:///S:/test.mjs:4:11
    at ModuleJob.run (node:internal/modules/esm/module_job:198:25)
    at async Promise.all (index 0)
    at async ESMLoader.import (node:internal/modules/esm/loader:385:24)
    at async loadESM (node:internal/process/esm_loader:88:5)
    at async handleMainPromise (node:internal/modules/run_main:61:12) {
  code: 'ERR_INVALID_FILE_URL_PATH'
}

Additional information

If it is deemed that there is some kind of bug in NodeJS regarding this, I'm hoping url.fileURLToPath can handle this gracefully.

FWIW, python3 is also perfectly happy to handle this URL

from urllib import request
data = request.urlopen('file:////some-server-cifs/vmgr/17/testdir/foo.txt')
for line in data:
    print(line)
aduh95 commented 1 year ago

It'd be interesting to see if some spec dictates a "correct" behavior for this case, because I think Node.js behavior is "technically more correct" here. /cc @nodejs/url

anonrig commented 1 year ago

Here is the part referencing the pathname state in URL specification: https://url.spec.whatwg.org/#path-state

Pathname can consist of unlimited slashes. Specification states that, for path state if entry is / and: Otherwise, if buffer is not a single-dot URL path segment, then: Append buffer to url’s path.`

This means that URLs can have unlimited / in their pathname. This is why the following succeeds:

> new URL('file://///////////some-server-cifs/vmgr/17/testdir/foo.txt')
URL {
  href: 'file://///////////some-server-cifs/vmgr/17/testdir/foo.txt',
  origin: 'null',
  protocol: 'file:',
  username: '',
  password: '',
  host: '',
  hostname: '',
  port: '',
  pathname: '///////////some-server-cifs/vmgr/17/testdir/foo.txt',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}

I would expect either v8 to return valid URLs or NodeJS' url.fileURLToPath() to sanitize the "slightly malformed" URL and parse it correctly just like Chrome and Windows Explorer do.

According to the URL specification, this input is not malformed, and 100% valid.

FWIW, python3 is also perfectly happy to handle this URL

Python's default URL parser is not compliant to WHATWG. Node.js built-in URL parser (Ada), has a Python library which demonstrates couple of major differences: https://github.com/ada-url/ada-python

aduh95 commented 1 year ago

@anonrig I'd be more interested in a spec that tells us how to get a file path from a URL, which is the question here. From the OP, here's the current state of things (✅ means "the URL is interpreted as \\some-server-cifs\vmgr\17\testdir\foo.txt", ❌ means it throws an error):

URL Python Node.js
file://some-server-cifs/vmgr/17/testdir/foo.txt
file:////some-server-cifs/vmgr/17/testdir/foo.txt

Is Node.js throwing in the second case something we actually want, or should we align with the behavior of Python?

anonrig commented 1 year ago

Python uses a combination of RFC 3986 and WHATWG, but mainly to RFC 3986.

According to the urllib.parse documentation: https://docs.python.org/3/library/urllib.parse.html

When you parse new URL('file://some-server-cifs/vmgr/17/testdir/foo.txt'), it gives you a hostname of some-server-cifs which results into a UNC path.

But when you parse new URL('file:////some-server-cifs/vmgr/17/testdir/foo.txt') it returns a hostname of '' because the starting character of pathname is / (except for certain edge cases). Therefore, hostname returning empty string, therefore expects the second character of the pathname to be : in order to accept it as a absolute URL.

aduh95 commented 1 year ago

hostname returning empty string, therefore expects the second character of the pathname to be : in order to accept it as a absolute URL.

You say "therefore", but I don’t see what’s the logical link between the two. Who said that UNC paths could not be represented with a file: URL that has an empty string for hostname? Is there a spec somewhere or is it something Node.js came up with?

anonrig commented 1 year ago

According to WHATWG, both URLs are valid, but fileURLToPath expects either a URL with a hostname (such as UNC), or a URL that consists of a Windows drive letter (which is something like file:///c:/) for Windows (not for posix).

This is the flow chart for parsing file:////hello in WHATWG spec:

flowchart LR
    file_state --> file_slash_state
    file_slash_state --> file_host_state
    file_host_state --> path_start_state
    path_start_state --> path_state

According to my understanding, file:////Hello is not a valid path because it does not start with a Windows drive letter, since //Hello is not a valid path in Windows.

What I'm having trouble understanding is, can Windows have a file at ////some-server-cifs/ path?

gurupras commented 1 year ago

What I'm having trouble understanding is, can Windows have a file at ////some-server-cifs/ path?

I don't think the path is valid. I don't know how to get C/C++ running on Windows, but Golang reports the path is invalid.

Not sure if this helps, but Python has the following code to deal with paths like these:

if url[:4] == '////':
    # path is something like ////host/path/on/remote/host
    # convert this to \\host\path\on\remote\host
    # (notice halving of slashes at the start of the path)
    url = url[2:]

This code is invoked from url2pathname called from open_local_file

lemire commented 1 year ago

I think that we have established that the URL is valid as per WHATWG.

It would be interesting how you end up with this URL in the first place?

gurupras commented 1 year ago

Like I mentioned in the original post, these are URLs generated and returned by v8's Profiler.takePreciseCoverage when the files in question are hosted on a networked filesystem.

We ran into this while using Jest to run tests and capture code-coverage for code that exists on a networked filesystem. For code-coverage, our Jest configuration uses v8. Internally, jest outsources the v8 code-coverage logic to a lightweight library, collect-v8-coverage, that wraps inspector calls. When stepping through Jest source-code, I noticed that the URLs were the result of collect-v8-coverage's stopInstrumenting() API, which internally calls Profiler.takePreciseCoverage.

ddurschlag commented 1 year ago

According to my understanding, file:////Hello is not a valid path because it does not start with a Windows drive letter, since //Hello is not a valid path in Windows.

I don't think this is true, because while //Hello is not a valid path in Windows, \\Hello is.

TimothyGu commented 1 year ago

Some background here. There's essentially no standard that's written down anywhere regarding Windows file URL parsing. The position that WHATWG's URL spec has taken is to represent Windows paths as follows:

C:\abc      → file:///C:/abc
\\host\path → file://host/path

Since Node.js uses the WHATWG interpretation mostly, that's what getPathFromURLWin32 does also.[^1]

However, it seems like V8's Profiler.takePreciseCoverage returns a file URL that doesn't conform to the WHATWG style, though Chrome and Windows Explorer (which I'm guessing uses some variant of PathCreateFromUrlW) do support it.

As such, there are essentially two options:

  1. Either Node.js can loosen up getPathFromURLWin32 to be more lenient in parsing file URLs (like Python), or
  2. Consumers of V8's coverage data can rewrite URLs into the WHATWG form, to work with Node.js.

Both would be fine with me. However, considering:

I believe the second option is probably better.

[^1]: I believe the main reason behind WHATWG's decision is the desire to avoid platform-dependent behavior in the parser. Chrome uses an entirely separate file URL parser when it's running on Windows vs. POSIX systems.

lemire commented 1 year ago

Thanks @TimothyGu.

If Node.js goes with option 2, then I suggest it should be documented (in the Node.js public documentation).

cc @anonrig