kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.03k stars 198 forks source link

CI: consistent naming of artifacts #1049

Closed GreyCat closed 1 year ago

GreyCat commented 1 year ago

Soliciting opinions on revisiting branch naming convention when we store artifacts in ci_artifacts. Right now, it's fairly inconsistent:

Some common patterns are seen:

Proposal: let's standardize everything consistently using the approach similar to the Docker images:

$TARGET$SUBTARGET/$IMPLEMENTATION-$OS-$ARCH

If necessary, we can even keep track of generated artifacts history by tagging these branches prior to commit.

Cc @generalmimon?

generalmimon commented 1 year ago

@GreyCat I didn't realize until now that this could be a problem, but I agree with you that it would be good to standardize this. It's true that the current naming pattern just uses the least effort to distinguish the individual subtargets of one target language based on what environments/implementations we currently test on. But there's a problem with this inconsistency when you've for example tested a particular target language only on Linux so far and now you want to test on other plaftorms, too - then you have to rename the existing branches (and Git doesn't even handle remote branch renaming very well, AFAIK you have to basically delete the existing branch and create a new one), and that's not very nice - of course it's better to have a consistent future-proof naming scheme from the start.

And obviously we may also want to test on other implementations of a target language, too - for example so far we only run Python on the CPython implementation, but we could also test on PyPy, etc.

Proposal: let's standardize everything consistently using the approach similar to the Docker images:

$TARGET$SUBTARGET/$IMPLEMENTATION-$OS-$ARCH

Looks good. I don't think we used dashes - before, but I see the benefit of using a different separator between the $IMPLEMENTATION/$OS/$ARCH parts because then you can keep the underscore _ reserved for when you want to separate multiple words within one part (which is kind of already being done for TARGET = cpp_stl, but it's not hard to imagine e.g. construct/python3.11_cpython-linux-x86_64, where IMPLEMENTATION = python3.11_cpython).

All in all, feel free to go ahead with it :)

Mingun commented 1 year ago

From my point of view I don't see any obvious unconsistences. For example, what the difference would be between artifacts of Java, Python or JavaScript targets for different OSes? All those languages already OS-independent so it seems just confusing to have different artifacts for them even if they are actually would be the same

generalmimon commented 1 year ago

@Mingun:

All those languages already OS-independent

If only this were true...

so it seems just confusing to have different artifacts for them even if they are actually would be the same

Confusing to whom? At https://ci.kaitai.io/ you normally don't even notice how many platforms/versions have run, unless there are differences in test results of individual subtargets (status "mixed" or "failed" with different stack traces).

Mingun commented 1 year ago

If only this were true...

And this is not (in context of Kaitai Struct, where the difference only could be in a file reading code)?

Confusing to whom?

To the maintainers who would try to understand what the difference of (figuratively) java-openjdk7-linux vs. java-openjdk7-windows. Remembering that during build there is only assemble jar, without difference, for windows or linux...

GreyCat commented 1 year ago

@GreyCat I didn't realize until now that this could be a problem, but I agree with you that it would be good to standardize this. It's true that the current naming pattern just uses the least effort to distinguish the individual subtargets of one target language based on what environments/implementations we currently test on. But there's a problem with this inconsistency when you've for example tested a particular target language only on Linux so far and now you want to test on other plaftorms, too - then you have to rename the existing branches (and Git doesn't even handle remote branch renaming very well, AFAIK you have to basically delete the existing branch and create a new one), and that's not very nice - of course it's better to have a consistent future-proof naming scheme from the start.

Well, my point is very simple: right now our publication to ci_artifacts from Docker images uses:

"${TARGET}${SUBTARGET}"/"$IMPL"

(see main.yml#L194), as this is very consistent with running using these three:

./docker-ci -t "$TARGET" -u "$SUBTARGET" -i "$IMPL"

(see main.yml#L179). This results in branches like cpp_stl_98/gcc11, which is very bare and ignores OS and arch, which is quite important.

Another point is that I want to add some automation that will pull current CI master results and compare fresh local results vs reference master, and, again, I feel bad about pulling it without reference to OS and arch.

Looks good. I don't think we used dashes - before, but I see the benefit of using a different separator between the $IMPLEMENTATION/$OS/$ARCH parts because then you can keep the underscore _ reserved for when you want to separate multiple words within one part (which is kind of already being done for TARGET = cpp_stl, but it's not hard to imagine e.g. construct/python3.11_cpython-linux-x86_64, where IMPLEMENTATION = python3.11_cpython).

Good point. I also want to keep it consistent with overall Docker image naming standard as much as possible.

Thanks for the feedback!

GreyCat commented 1 year ago

If only this were true...

And this is not (in context of Kaitai Struct, where the difference only could be in a file reading code)?

I don't have anything immediate to point a finger to (and that's the reason why we normally just run most of these OS-independent languages on Linux Docker only and that's it). However:

  1. We already have some very OS and implementation dependent targets, such as C++ and C#.
  2. Even relatively OS-independent languages have native-specific features in them that sometimes we might want to test. E.g. everything that maps structures directly in memory layout, or interfacing with native APIs.
  3. Even relatively OS-independent languages, in the end of the day, will most likely call a libc/WinAPI function, which will implement things in a way unique to the platform. External stuff (like encodings, compression, encryption) will likely have some degree of difference on how this works on different platforms.
generalmimon commented 1 year ago

@GreyCat:

2. Even relatively OS-independent languages have native-specific features in them that sometimes we might want to test. E.g. everything that maps structures directly in memory layout, or interfacing with native APIs.

Yes. In Kaitai Struct, we need to work with files and sometimes memory-mapped files. And files do have slight OS-specific differences in behavior. For example, file paths are case-sensitive on Unix and case-insensitive on Windows. On Unix, there isn't a difference between text and binary files (so even if you open a file in the default text mode, you can treat it as binary without problems), but if you open a file in text mode on Windows, some line ending conversions will happen (which is something you don't want for binary files), so you have to make sure to open files in binary mode if you intend to support Windows. On Windows, you cannot remove a file while it's open (on Linux you can, although AFAIR it'll be physically removed only after it's been closed). And in Python, pipes/sockets behave as non-seekable on Unix (as they should), but "seekable" on Windows (see https://github.com/python/cpython/issues/78961, https://github.com/python/cpython/issues/86768). And so on...

The point is that even these relatively OS-independent languages don't (and sometimes even can't) fully abstract you from the underlying differences of OS behavior.

3. Even relatively OS-independent languages, in the end of the day, will most likely call a libc/WinAPI function, which will implement things in a way unique to the platform. External stuff (like encodings, compression, encryption) will likely have some degree of difference on how this works on different platforms.

Exactly. On Windows it seems to be quite common to use the OS-specific means for character encoding conversions (i.e. with the infamous codepages) even in these relatively OS-independent languages, but this may mean a different set of encodings supported, slightly different behavior, etc.


When it comes to testing, I'm generally a fan of not making too many assumptions that something will work somewhere even if you don't test it (because this has been disproved many times) - I prefer to test as much as possible. Ideally, we should test each environment (i.e. language version, implementation, platform, architecture) we intend to support. It's not always feasible, but we can get close.

Of course, it's even more important to test languages that are specifically not OS-independent like C++ on different platforms (we have literally OS-specific code in our C++ runtime library, for example) than e.g. Java, but that doesn't mean I don't see any point in testing Java (and other languages) on different platforms too.

GreyCat commented 1 year ago

Went ahead and implemented the change:

  1. Renamed ci_artifacts branches which were produced by modern GH Actions pipeline (either dockerized or trad).
  2. Adjusted GH Actions pipeline to produce artifacts with new names: https://github.com/kaitai-io/ci_targets/commit/9c855e93b3b5b3c57f89847357ddbf2c03d1daa1
  3. Here's the GH Actions run: https://github.com/kaitai-io/ci_targets/actions/runs/5598729909/jobs/10238764894
  4. Adjusted CI page to show this all properly: https://github.com/kaitai-io/kaitai_ci_ui/commit/c2f65cce9c3a03ce4947264fd07171702ba25519
GreyCat commented 1 year ago

With almost full transition from Travis to GHA, this actually covers 100% of our targets/subtargets, they should be all in the same format.

The only one left is mono, which I would just remove for now.