publicsuffix / list

The Public Suffix List
https://publicsuffix.org/
Mozilla Public License 2.0
2.06k stars 1.22k forks source link

Add versioning information to "public_suffix_list.dat" file #1808

Open TurtleWilly opened 1 year ago

TurtleWilly commented 1 year ago

It would be nice to have some sort of (automatic) versioning information directly inside the "public_suffix_list.dat" file. Currently it is practically impossible to determine which file is the most current from a set of multiple "public_suffix_list.dat" on disk. This probably also could be useful for libpsl to determine what the "latest" is.

With CVS or SVN we could add // $Id$ as the first line of the file and the problem would solve itself (svn may need a propset depending on the configuration). The source control system would then automatically insert current version and/or date during the checkout (I'm not too familiar with git and if it has a similar feature or not.)

eli-schwartz commented 1 year ago

You can do the same thing in git with $Format:%cs$ where %cs is the formatter code to embed a YYYY-MM-DD style timestamp of the commit date (not the checkout date).

There are no tags so git describe can't be used with any degree of accuracy.

dnsguru commented 1 year ago

@smarnach is this possible?

weppos commented 1 year ago

Git doesn't ship with an $id$ equivalent feature. Instead, you are encouraged to leverage SHAs generated by Git itself.

In order to embed an external information, like the SHA or any other ID, we would need to pre-process the file before being committed. This is generally the responsibility of a CI/pipeline that we don't have.

I am not inclined to add such complexity in the file itself when this is within the repo, as it would be redundant since we can leverage git.

Ideally, the tagging should happen in the pipeline that processes the list for distribution at https://publicsuffix.org/list/public_suffix_list.dat

Although these days I even question whether we still need such distribution mechanism and we shouldn't instead just rely on Git hosting.

For consumers that need/want version tagging the current solution would be to switch towards pulling the list directly from the repo. I've actually been doing it for years in the library I maintain, here's an example:

https://github.com/weppos/publicsuffix-go/commit/a20f9abcc222b049ef9b7a28845bac88e0155ae3

https://github.com/weppos/publicsuffix-go/blob/a20f9abcc222b049ef9b7a28845bac88e0155ae3/publicsuffix/generator/gen.go#L24-L49

dnsguru commented 1 year ago

I believe that the .dat file instructs that it only be pulled from the publicsuffix.org url in order to utillize cdn/cloud services.

Taking note here of the value of this suggestion, I wonder if we couldn't add automation that adds a date to the file itself in plaintext within the initial header comment section when merging.

I believe this would be valuable towards Universal Acceptance.

As an example, the date would be abundantly clear to someone how stale their list is if they incorporate it in a static manner in their use or incorporation of the list.

Looking at https://github.com/publicsuffix/list/issues/1807 as an example. Whatsapp would know more clearly that they have an 8 year old copy of the PSL in use from 2015.

On Tue, Aug 1, 2023, 1:44 AM Simone Carletti @.***> wrote:

Git doesn't ship with an $id$ equivalent feature. Instead, you are encouraged to leverage SHAs generated by Git itself.

In order to embed an external information, like the SHA or any other ID, we would need to pre-process the file before being committed. This is generally the responsibility of a CI/pipeline that we don't have.

I am not inclined to add such complexity in the file itself when this is within the repo, as it would be redundant since we can leverage git.

Ideally, the tagging should happen in the pipeline that processes the list for distribution at https://publicsuffix.org/list/public_suffix_list.dat

Although these days I even question whether we still need such distribution mechanism and we shouldn't instead just rely on Git hosting.

For consumers that need/want version tagging the current solution would be to switch towards pulling the list directly from the repo. I've actually been doing it for years in the library I maintain, here's an example:

@.*** https://github.com/weppos/publicsuffix-go/commit/a20f9abcc222b049ef9b7a28845bac88e0155ae3

https://github.com/weppos/publicsuffix-go/blob/a20f9abcc222b049ef9b7a28845bac88e0155ae3/publicsuffix/generator/gen.go#L24-L49

— Reply to this email directly, view it on GitHub https://github.com/publicsuffix/list/issues/1808#issuecomment-1659852176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACQTJJZD7RC7N7YNYLOVADXTC6YTANCNFSM6AAAAAA2UNFXM4 . You are receiving this because you commented.Message ID: @.***>

smarnach commented 1 year ago

Cloud Storage returns the date the list was last modified in the Last-Modified header, so anyone is free to post-process the file when downloading it via the CDN. It would also be easy to modify the deployment workflow to include the date in the file when uploading the data. From an operational point of view, I don't have any concerns about doing this, so it's up to you to make the call here, @weppos and @dnsguru. I'm happy to make the required changes if you want me to.

eli-schwartz commented 1 year ago

Git doesn't ship with an $id$ equivalent feature. Instead, you are encouraged to leverage SHAs generated by Git itself.

I specifically pointed out that it does indeed do precisely this. It's part of the git-archive(1) machinery, for example the thing that github uses to generate https://github.com/publicsuffix/list/archive/refs/heads/master.tar.gz

It doesn't affect git clones, although you could invoke that machinery pretty easily:

git archive HEAD <filename> | bsdtar -x -C path/to/output/directory -f -
dnsguru commented 1 year ago

Because the gTLD list from ICANN's JSON has a timestamp in it, and that's the most often updated element, I'd assert that "Solution Exists" if one were to track that as the last date. It does not account for deltas that occur between auto-pulls from ICANN, but due to the frequency of those, and their priority of processing ahead of subdomain projects, this works itself out relatively well.

dnsguru commented 1 year ago

Cloud Storage returns the date the list was last modified in the Last-Modified header, so anyone is free to post-process the file when downloading it via the CDN. It would also be easy to modify the deployment workflow to include the date in the file when uploading the data. From an operational point of view, I don't have any concerns about doing this, so it's up to you to make the call here, @weppos and @dnsguru. I'm happy to make the required changes if you want me to.

In reviewing #1855 / #1856 - in order to avoid confusion about versions of security reports that would cause further disposible volunteer resource drain in hunting, we may want to tie doing these things together:

I have seen salient arguments for doing both and also for doing neither, but it seems like datestamp would be prereq should we implement a security policy were that to proceed.

eli-schwartz commented 1 year ago

Would you be interested in an implementation of the git-archive side of this on the theory that it causes no harm to have this literal text in the file:

// this is not guaranteed to be updated, but will contain either "$Format" or else a YYYY-MM-DD timestamp
// Date updated: $Format:%cs$

and under some conditions, at least, it would be a benefit since it would actually contain:

// this is not guaranteed to be updated, but will contain either "$Format" or else a YYYY-MM-DD timestamp
// Date updated: 2023-10-02