Versioning - Githubissues

kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.

http://kaldi-asr.org

Other

14.24k stars 5.32k forks source link

Versioning #1179

Closed danpovey closed 4 years ago

danpovey commented 7 years ago

People often ask about Kaldi versioning, but the continuous, many-threaded nature of Kaldi development makes it hard to meaningfully use a conventional version number. People have also commented that it's a hassle that you cannot get the version number from compiled Kaldi binaries.

This 'issue' is to sketch out a possible solution for this, and to get comments. This won't necessarily be done right away, or at all.

Firstly, I don't propose to do conventional versioning where we release meaningfully numbered versions of Kaldi-- it's too much work to maintain. But a reasonable compromise would be to just mark a new version number every week at a certain time. The next commit after that time would [via a git-hook] be marked as that version, say version 1001, and commits after that [or commits derived from repositories whose git hash did not match the git hash of the clean version number] would be indicated as differing by a certain number of commits from the base version number, e.g. version 1001+15. [We'd have to figure out precisely what git command to derive this information from.] The idea is that you'd type, for instance,

copy-feats --version

and it would print out: copy-feats (Kaldi) version 1001, git hash ac48cfb or for a non-clean version: copy-feats (Kaldi) version 1001+2, git hash bcd8f92 or if the src/ directory had non-checked-in changes to checked-in files [i.e. dirty index], it would print copy-feats (Kaldi) version 1001+2, git hash bcd8f92 [15 file changes not tracked] We could also consider incorporating the date of compilation somehow.

To avoid triggering recompilation of all the object files, this information would have to be not in a globally accessible header, but one only visible to a .cc file. For example, it could be put somewhere like base/version.h, which would only be included by a .cc file or files in base/. This file would be recreated every time 'make' was run, by a script. The script would read, say, 'src/.version', which would contain the version number and the git hash (e.g. "1001 ac48cfb"), and would then attempt to run git commands to figure out how far the git version was from the hash in src/.version, and similar information, and put the version string in base/version.h, as something like

static const char *version_string = "version 1001, git hash ac48cfb";

It would write to a temporary file and only copy to base/version.h only if the contents had changed (to prevent triggering unnecessary recompilation). In kaldi-error.h there would be a function declared

const char *GetVersionString();

that would return this string, for use elsewhere. The code in util/parse-options.{h,cc} would be changed to support an extra standard option --version, and would modify the message printed in --help to include the version number.

An outstanding issue is how to have the version number printed out so that it appears in log files the right amount, but not too much. One possibility is to modify the code that automatically prints out the command line used, to instead of printing, say,

copy-matrix ark:- ark:-

so that it would print

copy-matrix ark:- ark:-  # version 1001, git hash ac48cfb

(The hash mark means that if you copy and run that command line, it will still work).

Another possibility is to modify the MessageLogger:HandleMessage function so that if this is the first ever message printed by the program, you print a special LOG message with the version number info. But I think the above method might be cleaner.

An issue with this is that every time the version string changes (which might be often), if the person had configured without the --shared option, re-linking would have to be done in every single bin/ directory. That is inconvenient, and is the main drawback of this whole approach. To remedy this we could switch to recommending use of the --shared option for default Kaldi installation (by changing /INSTALL and the order of suggested command lines in /src/configure).

sikoried commented 7 years ago

It would be great to have the git commit hashes built into the (debug) log output. I'm not sure if (automated) versioning makes a lot of sense, but maybe irregular manual tags to the (i++)th version would make sense, if chosen somewhat thoughtful (e.g.: code&recipes for certain feature(s) added&tested, recipe for a language added&tested).

Korbinian.

On Wed, Nov 9, 2016 at 10:39 PM, Daniel Povey notifications@github.com wrote:

People often ask about Kaldi versioning, but the continuous, many-threaded nature of Kaldi development makes it hard to meaningfully use a conventional version number. People have also commented that it's a hassle that you cannot get the version number from compiled Kaldi binaries.

This 'issue' is to sketch out a possible solution for this, and to get comments. This won't necessarily be done right away, or at all.

Firstly, I don't propose to do conventional versioning where we release meaningfully numbered versions of Kaldi-- it's too much work to maintain. But a reasonable compromise would be to just mark a new version number every week at a certain time. The next commit after that time would [via a git-hook] be marked as that version, say version 1001, and commits after that [or commits derived from repositories whose git hash did not match the git hash of the clean version number] would be indicated as differing by a certain number of commits from the base version number, e.g. version 1001+15. [We'd have to figure out precisely what git command to derive this information from.] The idea is that you'd type, for instance,

copy-feats --version

and it would print out: copy-feats (Kaldi) version 1001, git hash ac48cfb or for a non-clean version: copy-feats (Kaldi) version 1001+2, git hash bcd8f92 or if the src/ directory had non-checked-in changes to checked-in files [i.e. dirty index], it would print copy-feats (Kaldi) version 1001+2, git hash bcd8f92 [15 file changes not tracked] We could also consider incorporating the date of compilation somehow.

To avoid triggering recompilation of all the object files, this information would have to be not in a globally accessible header, but one only visible to a .cc file. For example, it could be put somewhere like base/version.h, which would only be included by a .cc file or files in base/. This file would be recreated every time 'make' was run, by a script. The script would read, say, 'src/.version', which would contain the version number and the git hash (e.g. "1001 ac48cfb"), and would then attempt to run git commands to figure out how far the git version was from the hash in src/.version, and similar information, and put the version string in base/version.h, as something like

static const char *version_string = "version 1001, git hash ac48cfb";

It would write to a temporary file and only copy to base/version.h only if the contents had changed (to prevent triggering unnecessary recompilation). In kaldi-error.h there would be a function declared

const char *GetVersionString();

that would return this string, for use elsewhere. The code in util/parse-options.{h,cc} would be changed to support an extra standard option --version, and would modify the message printed in --help to include the version number.

An outstanding issue is how to have the version number printed out so that it appears in log files the right amount, but not too much. One possibility is to modify the code that automatically prints out the command line used, to instead of printing, say,

copy-matrix ark:- ark:-

so that it would print

copy-matrix ark:- ark:- # version 1001, git hash ac48cfb

(The hash mark means that if you copy and run that command line, it will still work).

Another possibility is to modify the MessageLogger:HandleMessage function so that if this is the first ever message printed by the program, you print a special LOG message with the version number info. But I think the above method might be cleaner.

An issue with this is that every time the version string changes (which might be often), if the person had configured without the --shared option, re-linking would have to be done in every single bin/ directory. That is inconvenient, and is the main drawback of this whole approach. To remedy this we could switch to recommending use of the --shared option for default Kaldi installation (by changing /INSTALL and the order of suggested command lines in /src/configure).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179, or mute the thread https://github.com/notifications/unsubscribe-auth/ADhueMnPnkqY0H1epGwyKuSN_FcS7FOyks5q8j2AgaJpZM4KuA_V .

danpovey commented 7 years ago

The issue with including just the git hashes is that if someone has made any local change whatsoever, the git hash becomes meaningless. So the version number gives us at least some indication what Kaldi version it was modified from.

On Wed, Nov 9, 2016 at 5:19 PM, Korbinian notifications@github.com wrote:

It would be great to have the git commit hashes built into the (debug) log output. I'm not sure if (automated) versioning makes a lot of sense, but maybe irregular manual tags to the (i++)th version would make sense, if chosen somewhat thoughtful (e.g.: code&recipes for certain feature(s) added&tested, recipe for a language added&tested).

Korbinian.

On Wed, Nov 9, 2016 at 10:39 PM, Daniel Povey notifications@github.com wrote:

People often ask about Kaldi versioning, but the continuous, many-threaded nature of Kaldi development makes it hard to meaningfully use a conventional version number. People have also commented that it's a hassle that you cannot get the version number from compiled Kaldi binaries.

This 'issue' is to sketch out a possible solution for this, and to get comments. This won't necessarily be done right away, or at all.

Firstly, I don't propose to do conventional versioning where we release meaningfully numbered versions of Kaldi-- it's too much work to maintain. But a reasonable compromise would be to just mark a new version number every week at a certain time. The next commit after that time would [via a git-hook] be marked as that version, say version 1001, and commits after that [or commits derived from repositories whose git hash did not match the git hash of the clean version number] would be indicated as differing by a certain number of commits from the base version number, e.g. version 1001+15. [We'd have to figure out precisely what git command to derive this information from.] The idea is that you'd type, for instance,

copy-feats --version

and it would print out: copy-feats (Kaldi) version 1001, git hash ac48cfb or for a non-clean version: copy-feats (Kaldi) version 1001+2, git hash bcd8f92 or if the src/ directory had non-checked-in changes to checked-in files [i.e. dirty index], it would print copy-feats (Kaldi) version 1001+2, git hash bcd8f92 [15 file changes not tracked] We could also consider incorporating the date of compilation somehow.

To avoid triggering recompilation of all the object files, this information would have to be not in a globally accessible header, but one only visible to a .cc file. For example, it could be put somewhere like base/version.h, which would only be included by a .cc file or files in base/. This file would be recreated every time 'make' was run, by a script. The script would read, say, 'src/.version', which would contain the version number and the git hash (e.g. "1001 ac48cfb"), and would then attempt to run git commands to figure out how far the git version was from the hash in src/.version, and similar information, and put the version string in base/version.h, as something like

static const char *version_string = "version 1001, git hash ac48cfb";

It would write to a temporary file and only copy to base/version.h only if the contents had changed (to prevent triggering unnecessary recompilation). In kaldi-error.h there would be a function declared

const char *GetVersionString();

that would return this string, for use elsewhere. The code in util/parse-options.{h,cc} would be changed to support an extra standard option --version, and would modify the message printed in --help to include the version number.

An outstanding issue is how to have the version number printed out so that it appears in log files the right amount, but not too much. One possibility is to modify the code that automatically prints out the command line used, to instead of printing, say,

copy-matrix ark:- ark:-

so that it would print

copy-matrix ark:- ark:- # version 1001, git hash ac48cfb

(The hash mark means that if you copy and run that command line, it will still work).

Another possibility is to modify the MessageLogger:HandleMessage function so that if this is the first ever message printed by the program, you print a special LOG message with the version number info. But I think the above method might be cleaner.

An issue with this is that every time the version string changes (which might be often), if the person had configured without the --shared option, re-linking would have to be done in every single bin/ directory. That is inconvenient, and is the main drawback of this whole approach. To remedy this we could switch to recommending use of the --shared option for default Kaldi installation (by changing /INSTALL and the order of suggested command lines in /src/configure).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179, or mute the thread https://github.com/notifications/unsubscribe- auth/ADhueMnPnkqY0H1epGwyKuSN_FcS7FOyks5q8j2AgaJpZM4KuA_V .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-259543224, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu4Hg7Zy2AKYPZ6qWuVVguVqGJ2q9ks5q8kbrgaJpZM4KuA_V .

yifan commented 7 years ago

It is a good idea. Although you would oppose a meaningful version, I do think it is valuable for users. One way to get it is to use tags so developers don't have to care, but end users will use release tags to track should they update their settings to latest or continue using what they have installed. At least the version should be able to tell the difference between bugfix and other changes (although can be difficult in many cases). Printing version information when printing out command line looks the cleanest in my opinion. What about some different versioning for each components? KALDI is very modular, works are done in different components all the time. Having a components based version may work better than a global version in my opinion.

francisr commented 7 years ago

If you include the git hash, you can also check if the file has been modified and write this information next to the hash.

ngoel17 commented 7 years ago

These days, with multi threaded compile, Kaldi can compile within five minutes. However five minutes can be a lot for someone in debugging mode. So how about if we make a configuration setting --no-versions=true so that someone is in the debugging mode and doesn't want to compile kaldi, but wants to make a number of git commits, they can disable the versions feature in configure, and then .version file will not be updated by make, so only modified portions will re-compile.

On Thu, Nov 10, 2016 at 6:15 AM, Rémi Francis notifications@github.com wrote:

If you include the git hash, you can also check if the file has been modified and write this information next to the hash.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-259664687, or mute the thread https://github.com/notifications/unsubscribe-auth/AIZyeAu5DlFgnG26EB2zBFB72qDkN0AZks5q8vy3gaJpZM4KuA_V .

KarelVesely84 commented 7 years ago

Hi, this is interesting. IMHO the versioning should be as simple as possible (i.e. automatically driven), while being sufficiently good for the most typical use cases. The most likely scenario can be as follows: a user has 2 binaries ('old' one and 'current' one), he sees that these produce different outputs and wants to compare the code...

To support this, there could be a server hook for master branch in main repo, whenever its 'HEAD' changes. This would update the base/version.h, so that the SHA of last common ancestor will always be marked in any branch/fork for the purpose of back-tracking the changes in the main repo. IMHO the version number '1001' would be uneasy to interpret, instead there could be a date in YYYY-MM-DD format for a rough look when there was the last synchronization with the main branch.

This intentionally does not cover the local changes, as this is out of control. The beauty is in the simplicity... Perhaps there are more similar scenarios?

The code can then be compared with: git diff SHA1 SHA2 /path

jtrmal commented 7 years ago

What would be the rationale of using static libs nowadays? An easy solution might include creating a separate lib, say kaldi-version, which would be then linked separately to binaries/executables only. That might speed up the compilation. Also, depending on the reasons for using the static libraries, the kaldi-version could be always linked dynamically (i.e. even for static setup).

==Just a couple of ideas ==

Another option might be creating just a separate executable "kaldi-version" that could be called in some scripts to address the issue with informative yet non-obtrusive versioning info in the scripts.
Another option could be detecting if the STDIN is a pipe (that should be manageable in linux using fstat and/or isatty) and in that case avoid printing the versioning info -- that would lead to only the first script printing the version. Notice I say STDIN instead of a input file I'm reading from nor file I consider as input. The latter ones would need non-trivial modification of the codes of the executables.

== Versions ==

I second to the ideas has just automatic checkpoint created every week or so
To me, the unclear thing is how to figure out the versioning info to be meaningful.

On Thu, Nov 10, 2016 at 9:16 AM, Nagendra Goel notifications@github.com wrote:

These days, with multi threaded compile, Kaldi can compile within five minutes. However five minutes can be a lot for someone in debugging mode. So how about if we make a configuration setting --no-versions=true so that someone is in the debugging mode and doesn't want to compile kaldi, but wants to make a number of git commits, they can disable the versions feature in configure, and then .version file will not be updated by make, so only modified portions will re-compile.

On Thu, Nov 10, 2016 at 6:15 AM, Rémi Francis notifications@github.com wrote:

If you include the git hash, you can also check if the file has been modified and write this information next to the hash.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-259664687, or mute the thread https://github.com/notifications/unsubscribe-auth/ AIZyeAu5DlFgnG26EB2zBFB72qDkN0AZks5q8vy3gaJpZM4KuA_V

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-259700490, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisX9B5w0-ikDG_oJDZU2PFabuiUWsMks5q8ydUgaJpZM4KuA_V .

danpovey commented 7 years ago

A specific proposal regarding versioning:

What I am leaning towards is having a conventional version number (like 10.1.2), but the last element of the version number would be automatically updated via git-hooks. For example, there would be a file like base/.version that would contain something like 10.1, which we'd manually increment when needed, and also base/.full_version that would be updated by a post-commit hook on github to, for example, 10.1.3 when we committed something to master [and would get reset to zero when we incremented the major version.]

We could keep a documentation page that would list, in a convenient form, information about the versions and the major changes, and the dates they correspond to [perhaps with expandable elements for listing the minor commits.] We'd have to figure out a combination of automatic and manual updates for this page, so that we'd only need to manually update it when we changed base/.version.
The version-number info would be printed by each program as part of the first log, error or warning message printed by that program (but not printed if no logs were printed), and we'd arrange things such that only one dynamic library had to be recompiled for this (e.g. kaldi-base.so, which is pretty small anyway)... note: I think updating a dynamic library does not force recompilation of binary files unless the headers change.

E.g., LOG (gmm-acc-stats-ali:main():gmm-acc-stats-ali.cc:105) Processed 50 utterances; for utterance sw03279-B_024889-025633 avg. like is -54.8654 over 742 frames. LOG (gmm-acc-stats-ali:main():gmm-acc-stats-ali.cc:105) Processed 100 utterances; for utterance sw03280-B_023150-024242 avg. like is -54.2167 over 1090 frames. would become:

LOG (gmm-acc-stats-ali[10.2.21+4~5]:main():gmm-acc-stats-ali.cc:105) Processed 50 utterances; for utterance sw03279-B_024889-025633 avg. like is -54.8654 over 742 frames. LOG (gmm-acc-stats-ali:main():gmm-acc-stats-ali.cc:105) Processed 100 utterances; for utterance sw03280-B_023150-024242 avg. like is -54.2167 over 1090 frames.

The meaning of the +4~5 (and I haven't thought about this very carefully) would be as follows:

this repository is 4 commits ahead of the latest official version from the kaldi-asr project that it was merged with (i.e. 4 commits ahead of 10.2.21 in this case.. these would be local commits).
the ~5 means that there are 5 tracked files with uncommitted changes within the src/ subdirectory.

danpovey commented 7 years ago

Hm, I'm wondering whether my proposal is even possible. Are git post-commit hooks allowed to change the contents of commits, or make their own commits? I doubt this now. One possibility is to only update the major/minor version number and records its git hash, and to have the '+4' in my example record how far ahead of that we are; but that fails to distinguish between local and upstream changes. Also, if git post-commit hooks aren't allowed to change the contents of the repo, that would cause problems for the alternate proposals too, e.g. the ones based on dates. Someone else more knowledgeable about github may be able to enlighten us. @kkm000, we haven't heard from you for a while...

vdp commented 7 years ago

Leaving the technicalities aside for a moment, as far as I can tell from the comments so far, the potential uses for the versioning fall into three main categories:

diagnostics: provides a means to determine the version of the sources, that was used to compile a given binary, for debugging purposes
features-info: enables users to look up if a given feature is supported in the version they are running, and then say, decide if an upgrade is worth the effort
stability: maintenance of a non-cutting-edge, but stable Kaldi version

The diagnostics aspect is probably of limited utility for the end users, aside from use cases such as what Karel mentions where you wonder why the heck is this running OK here but is failing on this other server. It is also probably the most important aspect of the three from the maintainers' point of view(e.g. when someone complains on the mailing list).

The feature-info facet is probably of least practical importance from a day-to-day operations view, but if it can be implemented efficiently it would be a nice-to-have feature.

Arguably the stability aspect would be probably most useful for both companies and researchers, because it would provide something that they can use with more confidence than just the last version from the repo. What I have in mind as possible implementation here is e.g. making a stable version branch for which the feature set is frozen and only incorporates bugfixes, when a problem is discovered during the lifetime of that version, which could of order of 3 or 4 months at the current pace of Kaldi development. Of course this would be a significant time and effort commitment, so IMO you(Dan) should absolutely NOT do this unless someone is willing to step up and take care of it.

The diagnostics stuff looks like something that could be automated, as already discussed. I haven't thought much about this(and the versioning stuff as whole, admittedly), so it's quite possible I'm missing something, but is it really necessary to implement hooks etc? I mean, wouldn't it be possible to just figure out this stuff(i.e. date, hash, is-repo-dirty etc) and generate a header file at build time without any tighter integration with Git? Do we really need a "proper" version number- wouldn't be enough to have just hash, date, number of (possible) user commits in respect to "master" and whether there are uncommited changes? BTW it seems to me that the best place for this info is the options parsing code. It seems like the most robust option to me, even though it would potentially trigger recompilation of all the binaries. I understand Nagendra's concern about compilation time when you are developing or debugging, but I don't think we should be overly worried about this because when you are developing you are usually only interested in a specific piece of code, so you can(I believe) just run e.g. "make nnet3bin" or even "make nnet3bin/nnet3-latgen-faster" which should not take that much time.

As for documenting the features, I wonder if a low cost, effort-wise, way to achieve something like this is to adopt some sort convention for the commit messages? For example "FEATURE: nnet architecture X is now supported", "ENHANCEMENT: feature X extended to handle situation Y" or "BUGFIX: small change in X to handle border-case Y". Those can be then parsed by a script and used to produce a document page as proposed.

vince62s commented 7 years ago

I very much agree with @vdp . Working with another opensource project on github, I find it quite interesting to use the "release" concept given by github. here is an example: https://github.com/ModernMT/MMT/releases where you can easily all the commits in master since the last "release".

danpovey commented 7 years ago

My concern about using the 'releases' on github is, it seems to be geared towards having people download things as zipfiles. If they have a source tree that's not based on git, it would make it harder for people to receive updates in cases where they report a problem and it gets fixed. But I suppose it could be done as a form of documentation, and we can tell people that downloading the zipfile is highly discouraged. It seems that 'releases' are basically git tags (which are named pointers to specific commits), and github does some stuff on top of that to create download buttons, online documentation and the like.

I am open to the idea of maintaining a 'stable' branch to which only bug-fixes are given, but we need to see whether it makes more sense to have a 'stable' branch, or multiple branches for specific versions (only some of which would get actively updated with bug-fixes).

Vassil, when you say "wouldn't it be possible to just figure out this stuff(i.e. date, hash, is-repo-dirty etc) and generate a header file at build time without any tighter integration with Git"... the issue is that if you go back in the git history there is nothing that really identifies whether a particular commit was done "upstream" on github, or was done locally by the user, although it might be possible to add a particular string to the commit message that could be identified by a script, like [official] or something.

Here's a possible workflow, which would have version-specific branches:

On github we maintain branches for each major/minor version number that has been released except for the latest one, which would reside in master. We also occasionally create tags (possibly as releases) for a subset of the major/minor/patch version numbers. The patch version number gets incremented (we'd have to figure out how) every time we commit. I don't necessarily assume, here, that the version number exists as a file in the git repo, there may be some way to 'look it up', e.g. by counting the number of commits with a certain tag since the major/minor version number last changed. [So in that case there would be a file in the repo containing the major/minor version number, like 5.2, but the bugfix version number would be obtained via some kind of script and put into a file that's not tracked by git.]

For example, at some point in the future we might have branches for 5.0, 5.1, 5.2 and 6.0 (and assume the master's version number is 6.1). And we have tags for subset of specific major/minor/patch version numbers, such as 5.0.0, 5.0.5, 5.0.11, 5.0.12, 5.1.0, 5.1.15, .... and so on.

For the most part I would only deal with master, but we create a system whereby the commit messages are edited by me or other committers in a way that can be (a) possibly parsed by a script to generate documentation, and (b) understood by the people who maintain the older branches. We'll have to figure out the details, but the idea is that I add some kind of string or parenthetical comment in the commit message that indicates whether it should be applied to older branches (or which parts of it, or to which older branches). The older branches could be maintained by students or collaborators- it would consist of merging selected upstream changes and maybe doing some basic testing [we might have to change the travis configuration].

I would prefer to discourage the use of older branches (i.e. not-master). The fact that most Kaldi users use master is actually helpful for the developers, because it means that bugs are quickly found, pull requests by users can be made without hassle, and we don't spend our time talking about errors that have already been fixed.

On Sat, Dec 24, 2016 at 10:15 AM, vince62s notifications@github.com wrote:

I very much agree with @vdp https://github.com/vdp . Working with another opensource project on github, I find it quite interesting to use the "release" concept given by github. here is an example: https://github.com/ModernMT/MMT/releases where you can easily all the commits in master since the last "release".

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269094841, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu3xz0QDTTHNxNXK3sHIgsN-_WeZzks5rLWE2gaJpZM4KuA_V .

vince62s commented 7 years ago

If my understanding is correct this is somehow what they do for tensorflow. Keep a "branch" for each stable release, and track changes / patches / release candidates through "releases". This is also a good place to keep the documentation for changes log.

vdp commented 7 years ago

I personally use Git in a quite pedestrian way most of the time, so I'm not familiar with much of the deeper wizardry, but given it's a DVCS, where each individual repo stands on its own, I would be surprise if it doesn't provide a way to get the information we need. A quick search suggests that the command we are looking for might be "rev-list". For example to obtain the list of changes since the last pulled version of "master" we can do something like:

git rev-list --pretty=oneline remotes/origin/master..HEAD

(of course for some advanced users the remote may not be named "origin", but we can easily obtain the name of the official Kaldi repo w/ "git remote -v")

I still think that the implementation of "proper" stable branches may require too much "manual" support. I mean, the number of people contributing to Kaldi with any regularity is still relatively low. Wouldn't the support of stable branches stretch too much the limited developer bandwidth? Also, maybe we are overestimating the usefulness of this feature? For example as discussed not long ago on the list, a bunch of companies are using Kaldi commercially. If they have not shown desire to support something like this, then perhaps it's not that much of a pain point for them, and you'd be better off going for more "lightweight", maintainer-friendly options.. Of course you have much better view on the project, so it's something for you to decide. On the plus side the serious bugs in Kaldi, that make it to "master" doesn't seem to be that many(reflecting the high quality of the project as a whole), so perhaps I'm overestimating the efforts required. I think if you go for the "stable" branches option, it may be good, or indeed necessary given the "stability" goal, to have some sort of regression test(or whatever people call it) suite that goes beyond the unit testing. What I imagine would be good to have is a specially tailored "recipe" to be run before pushing the next patch to the stable branch. Of course I'm not talking about multi-thousand hour speed-perturbed Fisher English here, but rather something that would take no more than 2 hours or so to run, but will have a good coverage of the features people care about the most. I think it would be probably enough to support just one "stable" branch, in addition to "master", for about 4 months or so(could be longer if no significant features were added recently). This will be just enough to provide a period for "stabilization" of the new features, while in the same time not disincentivizing people too much, from switching to a new version, which as you say is beneficial for the project moving forward.

gfkubala commented 7 years ago

You may want to take a look at this facility for versioning.

https://github.com/smessmer/gitversion

It appears to cover all the important bases.

On Sat, Dec 24, 2016 at 7:24 PM, Daniel Povey notifications@github.com wrote:

My concern about using the 'releases' on github is, it seems to be geared towards having people download things as zipfiles. If they have a source tree that's not based on git, it would make it harder for people to receive updates in cases where they report a problem and it gets fixed. But I suppose it could be done as a form of documentation, and we can tell people that downloading the zipfile is highly discouraged. It seems that 'releases' are basically git tags (which are named pointers to specific commits), and github does some stuff on top of that to create download buttons, online documentation and the like.

I am open to the idea of maintaining a 'stable' branch to which only bug-fixes are given, but we need to see whether it makes more sense to have a 'stable' branch, or multiple branches for specific versions (only some of which would get actively updated with bug-fixes).

Vassil, when you say "wouldn't it be possible to just figure out this stuff(i.e. date, hash, is-repo-dirty etc) and generate a header file at build time without any tighter integration with Git"... the issue is that if you go back in the git history there is nothing that really identifies whether a particular commit was done "upstream" on github, or was done locally by the user, although it might be possible to add a particular string to the commit message that could be identified by a script, like [official] or something.

Here's a possible workflow, which would have version-specific branches:

On github we maintain branches for each major/minor version number that has been released except for the latest one, which would reside in master. We also occasionally create tags (possibly as releases) for a subset of the major/minor/patch version numbers. The patch version number gets incremented (we'd have to figure out how) every time we commit. I don't necessarily assume, here, that the version number exists as a file in the git repo, there may be some way to 'look it up', e.g. by counting the number of commits with a certain tag since the major/minor version number last changed. [So in that case there would be a file in the repo containing the major/minor version number, like 5.2, but the bugfix version number would be obtained via some kind of script and put into a file that's not tracked by git.]

For example, at some point in the future we might have branches for 5.0, 5.1, 5.2 and 6.0 (and assume the master's version number is 6.1). And we have tags for subset of specific major/minor/patch version numbers, such as 5.0.0, 5.0.5, 5.0.11, 5.0.12, 5.1.0, 5.1.15, .... and so on.

For the most part I would only deal with master, but we create a system whereby the commit messages are edited by me or other committers in a way that can be (a) possibly parsed by a script to generate documentation, and (b) understood by the people who maintain the older branches. We'll have to figure out the details, but the idea is that I add some kind of string or parenthetical comment in the commit message that indicates whether it should be applied to older branches (or which parts of it, or to which older branches). The older branches could be maintained by students or collaborators- it would consist of merging selected upstream changes and maybe doing some basic testing [we might have to change the travis configuration].

I would prefer to discourage the use of older branches (i.e. not-master). The fact that most Kaldi users use master is actually helpful for the developers, because it means that bugs are quickly found, pull requests by users can be made without hassle, and we don't spend our time talking about errors that have already been fixed.

On Sat, Dec 24, 2016 at 10:15 AM, vince62s notifications@github.com wrote:

I very much agree with @vdp https://github.com/vdp . Working with another opensource project on github, I find it quite interesting to use the "release" concept given by github. here is an example: https://github.com/ModernMT/MMT/releases where you can easily all the commits in master since the last "release".

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269094841, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADJVu3xz0QDTTHNxNXK3sHIgsN-_WeZzks5rLWE2gaJpZM4KuA_V

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269105358, or mute the thread https://github.com/notifications/unsubscribe-auth/ALB5F0ifI5__p9FAmfzQGFdBkcCzGb7Vks5rLbfQgaJpZM4KuA_V .

danpovey commented 7 years ago

I'm thinking of adding strings containing version numbers into the commit messages (these could be added when squashing-and-merging), e.g. "[10.5.13] Fix minor bug in kaldi-table-inl.h" Or as a shorthand for when we can't be bothered to figure out what the current version number is, "[+1] Fix minor bug in kaldi-table-inl.h"

and we could also have a file containing a fairly recent version number [not guaranteed to be fully recent], like 10.5.10. This could be used as a fallback option if the version number could not easily be worked out from git.

I agree that we shouldn't aim to provide bug-fixes for more than one or two stable versions.

danpovey commented 7 years ago

OK, some more specifics. @dogancan, I know you have been doing a lot lately, but do you have time to help with this? It would be nice to get this done before we check in your C++11 changes. This is a fairly limited and specific change to add version numbers, and doesn't include things like branches, tags and documentation.

To make the upgrade to using version numbers manageable, we need to add things gradually. First let's get a mechanism to keep track of a version number, and we can worry about the specifics of branches, tags, releases and documentation later on.

let's start from version 5.0.0 I propose to create a file called src/.version that contains a recent version number, and a comment, e.g.

cat .version
5.0.0
# This file contains a recent version number, but to make life easier
# for the developers this file does not necessarily contain the most
# up-to-date patch number (the last of the 3 fields).  The actual
# version number is worked out by base/get_version.sh from the git 
# history.

I propose to have a script base/get_version.sh create a file base/version.h, that will only be included by error.cc (to limit recompilation), containing the following type of contents:

/* This file is automatically created by ./get_version.sh.  It's only included by kaldi-error.cc */
#define VERSION_NUMBER "5.0.2"

get_version.sh will try to work out the version number from the recent git history (and it needs to be fast as it will be invoked frequently); if this fails it will print a warning and back off to the .version file . If the resulting version from the git history is less than the version in the .version file it will print a warning and maybe put a "?" at the end of the version-number string. We can later add various other strings here, including the git hash, to be printed out when we use the --version option. Also, for non-clean kaldi version we can consider having things like "5.0.2+5", meaning the latest 5 commits do not start with []. But for now that's not needed.

The version number will be printed in the first of any log, warn or error messages printed from kaldi-error.cc (be careful not to introduce any new space-separated fields, that might affect downstream parsing of the logs).

The idea is that base/get_version.sh is invoked whenever "make" is invoked in src/ or base/ (also it should be run when people type "make depend", or that will produce errors). The script should initially write its output to a temporary file, then only move it to the final location if it differs from the current contents, to ensure that the date does not change if its contents do not change.

This will make compilation with static libraries slow any time Kaldi is updated (since it would change base/kaldi-base.o). Let's just recommend dynamic libraries for now, and we can worry about the speed later if people turn out to have use-cases where they really can't use dynamic libraries and compilation speed is a problem.

vdp commented 7 years ago

This is hardly the most pressing issue, but IMO a version with major/minor numbers based on date, e.g. 2017.04 or 17.04(as in Ubuntu) are more expressive, than just using an index, whose meaning is not immediately clear. We could even "compress" it a little more, considering the first public Kaldi release was in 2011. If we imagine version 1.0 was released in that year, the major version for 2017 could be just 7. Personally I think Ubuntu's numbering scheme(i.e. using year 2K as a sort of 'epoch') would be better though.

Also, maybe it would be good to put the Git hash(or tag), corresponding to the version in this ".version" file. I'm not sure about the specific commands, but it would be probably possible to figure out if there are are user-specific commits using Git's graph traversal facilities. If my understanding is correct you propose to figure out this based on tags in the commit messages, but this seems somewhat brittle and error prone.

danpovey commented 7 years ago

Regarding version numbers based on date-- I don't like this idea as it removes our freedom to increment them when we want to.

I agree that putting the git hash of the commit corresponding to the version in the .version file is a good idea, it can have that as a second field after the version number; this will make the scripts's job easier in figuring out how clean/dirty the repo is (e.g. whether has there been rebasing). Of course this only works for fast-forward commits, otherwise the hash would change; but we won't need to write the .version file very often, so it's OK for me to do the merge manually in my repo and then either push it or make a PR that I know will be fast-forward. We can even update the travis script to fail quickly if the git hash is not correct.

Regarding the fact that figuring out the version based on tags is error prone-- yes, it would be, but we have a good prior of what the tag will be based on the fact that it should be a minor version number not too much larger than the version in .version. If things appear to be seriously messed up I plan to have the version number be printed as the number in version with a '?' after it, e.g. "5.12.3?". In general these things can get messed up in arbitrarily complex ways and there is no way to truly handle all the cases, so I only plan to handle the common cases and back off gracefully. The worst thing that will happen is that a slightly inaccurate or out of date version number will be printed-- most people will never even notice.

On Mon, Dec 26, 2016 at 11:53 PM, Vassil Panayotov <notifications@github.com

wrote:

This is hardly the most pressing issue, but IMO a version with major/minor numbers based on date, e.g. 2017.04 or 17.04(as in Ubuntu) are more expressive, than just using an index, whose meaning is not immediately clear. We could even "compress" it a little more, considering the first public Kaldi release was in 2011. If we imagine version 1.0 was released in that year, the major version for 2017 could be just 7. Personally I think Ubuntu's numbering scheme(i.e. using year 2K as a sort of 'epoch') would be better though.

Also, maybe it would be good to put the Git hash(or tag), corresponding to the version in this ".version" file. I'm not sure about the specific commands, but it would be probably possible to figure out if there are are user-specific commits using Git's graph traversal facilities. If my understanding is correct you propose to figure out this based on tags in the commit messages, but this seems somewhat brittle and error prone.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269287305, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu0ciNyQ024EFgAxlbC7uXpXo4bJXks5rMMP9gaJpZM4KuA_V .

danpovey commented 7 years ago

I just realized that you can't get the hash of a commit before committing it, so adding the hash of the current commit is not workable, but adding the hash of its parent is workable. We could just use the hash of either parent in case of merges, the script would be able to handle that. [The only point is to verify whether the commit history at that point was changed via rebasing or other magic since the time it was originally committed to the kaldi master.]

On Tue, Dec 27, 2016 at 12:06 AM, Daniel Povey dpovey@gmail.com wrote:

Regarding version numbers based on date-- I don't like this idea as it removes our freedom to increment them when we want to.

I agree that putting the git hash of the commit corresponding to the version in the .version file is a good idea, it can have that as a second field after the version number; this will make the scripts's job easier in figuring out how clean/dirty the repo is (e.g. whether has there been rebasing). Of course this only works for fast-forward commits, otherwise the hash would change; but we won't need to write the .version file very often, so it's OK for me to do the merge manually in my repo and then either push it or make a PR that I know will be fast-forward. We can even update the travis script to fail quickly if the git hash is not correct.

Regarding the fact that figuring out the version based on tags is error prone-- yes, it would be, but we have a good prior of what the tag will be based on the fact that it should be a minor version number not too much larger than the version in .version. If things appear to be seriously messed up I plan to have the version number be printed as the number in version with a '?' after it, e.g. "5.12.3?". In general these things can get messed up in arbitrarily complex ways and there is no way to truly handle all the cases, so I only plan to handle the common cases and back off gracefully. The worst thing that will happen is that a slightly inaccurate or out of date version number will be printed-- most people will never even notice.

On Mon, Dec 26, 2016 at 11:53 PM, Vassil Panayotov < notifications@github.com> wrote:

This is hardly the most pressing issue, but IMO a version with major/minor numbers based on date, e.g. 2017.04 or 17.04(as in Ubuntu) are more expressive, than just using an index, whose meaning is not immediately clear. We could even "compress" it a little more, considering the first public Kaldi release was in 2011. If we imagine version 1.0 was released in that year, the major version for 2017 could be just 7. Personally I think Ubuntu's numbering scheme(i.e. using year 2K as a sort of 'epoch') would be better though.

Also, maybe it would be good to put the Git hash(or tag), corresponding to the version in this ".version" file. I'm not sure about the specific commands, but it would be probably possible to figure out if there are are user-specific commits using Git's graph traversal facilities. If my understanding is correct you propose to figure out this based on tags in the commit messages, but this seems somewhat brittle and error prone.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269287305, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu0ciNyQ024EFgAxlbC7uXpXo4bJXks5rMMP9gaJpZM4KuA_V .

dogancan commented 7 years ago

I have a few questions/comments.

Why do we need a base/.version file? Can't we just use git tags to mark versions and use git-describe to retrieve the latest tag reachable from HEAD? It seems git-describe already does something similar to what we are trying to achieve with suffixes although there is no distinction between remote (clean?) commits and local (non-clean?) commits in the case of git-describe.
How do we count how many commits there are between the checkpoint commit (corresponding to the version in base/.version) and the HEAD if the history is not linear?
I don't like depending on commit messages to figure out version information. It seems like a fragile convention and I have a feeling it will put unnecessary strain on maintainers for little gain.

danpovey commented 7 years ago

1.

Why do we need a base/.version file? Can't we just use git tags to mark versions and use git-describe https://git-scm.com/docs/git-describe to retrieve the latest tag reachable from HEAD? It seems git-describe already does something similar to what we are trying to achieve with suffixes although there is no distinction between remote (clean?) commits and local (non-clean?) commits in the case of git-describe.

I'm open to using tags like that. But I don't want to have a script that's part of the build process require the repo to have been checked out using git-- people might have downloaded a zip file, or be using another type of repo, and I want it to be possible in principle to still build. So it should do something reasonable in that case; a .version file might be necessary anyway as a backup. Also, do people by default download all the tags when they clone the repo from github? I would have thought they just clone the master.

The issue with relying too strongly on tags, and this also speaks to your comment about relying on tags in commit messages, is-- I don't want to have to add a tag for every single commit to the kaldi repo. That would lead to an unacceptable number of tags accumulating. So we can't rely purely on tags to figure out whether the recent history was part of the "official" kaldi repo. I'm open to adding tags for all the major/minor version numbers though.

1. 2.

How do we count how many commits there are between the checkpoint commit (corresponding to the version in base/.version) and the HEAD if the history is not linear?

This isn't important- we can just decide, and put some kind of markers in the string to indicate that something was unusual.

1.

I don't like depending on commit messages to figure out version information. It seems like a fragile convention and I have a feeling it will put unnecessary strain on maintainers for little gain.

The problem that I was trying to solve with that is-- when people ask questions on the list I want to get a sense for what version of Kaldi they are using and (approximately) how much they have modified it from the upstream version-- and I don't want to tag every single Kaldi commit; and I want it to do something reasonable in various merge scenarios. For example, if I were to see a version number like 5.1.12+3u6 indicating that the most recent "official" version they merged with was 5.1.12, but they have 3 user commits, and 6 untracked files, it would give a lot of information.

The solution doesn't have to do exactly the right thing in all scenarios, it just has to do enough to be useful, and degrade reasonably gracefully (e.g. I don't want situations where it goes trawling through the entire git history looking for something that might not be there).

Think about it and see if you can come up with a plan.

dogancan commented 7 years ago

On Dec 27, 2016, at 2:53 PM, Daniel Povey notifications@github.com wrote:

1.

Why do we need a base/.version file? Can't we just use git tags to mark versions and use git-describe https://git-scm.com/docs/git-describe to retrieve the latest tag reachable from HEAD? It seems git-describe already does something similar to what we are trying to achieve with suffixes although there is no distinction between remote (clean?) commits and local (non-clean?) commits in the case of git-describe.

I'm open to using tags like that. But I don't want to have a script that's part of the build process require the repo to have been checked out using git-- people might have downloaded a zip file, or be using another type of repo, and I want it to be possible in principle to still build. So it should do something reasonable in that case; a .version file might be necessary anyway as a backup. Also, do people by default download all the tags when they clone the repo from github? I would have thought they just clone the master.

Hmm. I haven’t thought of the non-git based setups. Having a base/.version as a backup makes sense. I suppose in those setups we simply print whatever is inside base/.version.

Yes, cloning by default includes all tags.

The issue with relying too strongly on tags, and this also speaks to your comment about relying on tags in commit messages, is-- I don't want to have to add a tag for every single commit to the kaldi repo. That would lead to an unacceptable number of tags accumulating. So we can't rely purely on tags to figure out whether the recent history was part of the "official" kaldi repo. I'm open to adding tags for all the major/minor version numbers though.

I wasn’t suggesting adding a tag on each commit. I agree with your proposal of only marking major.minor number and figuring out patch number locally.

1. 2.

How do we count how many commits there are between the checkpoint commit (corresponding to the version in base/.version) and the HEAD if the history is not linear?

This isn't important- we can just decide, and put some kind of markers in the string to indicate that something was unusual.

1.

I don't like depending on commit messages to figure out version information. It seems like a fragile convention and I have a feeling it will put unnecessary strain on maintainers for little gain.

The problem that I was trying to solve with that is-- when people ask questions on the list I want to get a sense for what version of Kaldi they are using and (approximately) how much they have modified it from the upstream version-- and I don't want to tag every single Kaldi commit; and I want it to do something reasonable in various merge scenarios. For example, if I were to see a version number like 5.1.12+3u6 indicating that the most recent "official" version they merged with was 5.1.12, but they have 3 user commits, and 6 untracked files, it would give a lot of information.

The solution doesn't have to do exactly the right thing in all scenarios, it just has to do enough to be useful, and degrade reasonably gracefully (e.g. I don't want situations where it goes trawling through the entire git history looking for something that might not be there).

Think about it and see if you can come up with a plan.

I’ll get started with what git-describe provides by default and we can go from there.

I have a feeling we have a case of feature creep here :) Maybe we should address this problem of identifying local modifications separately from versioning, e.g. with a debugging script that will print all relevant info.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269394268, or mute the thread https://github.com/notifications/unsubscribe-auth/AAl2vte7glUaOLP7LP2Pm42HZQ8K26wqks5rMZbSgaJpZM4KuA_V.

danpovey commented 7 years ago

Yes, you're probably right about feature creep regarding local modifications. Maybe we should be satisfied with working out the Kaldi version number.

Dan

On Tue, Dec 27, 2016 at 3:49 PM, Dogan Can notifications@github.com wrote:

On Dec 27, 2016, at 2:53 PM, Daniel Povey notifications@github.com wrote:

1.

Why do we need a base/.version file? Can't we just use git tags to mark versions and use git-describe https://git-scm.com/docs/git-describe to retrieve the latest tag reachable from HEAD? It seems git-describe already does something similar to what we are trying to achieve with suffixes although there is no distinction between remote (clean?) commits and local (non-clean?) commits in the case of git-describe.

I'm open to using tags like that. But I don't want to have a script that's part of the build process require the repo to have been checked out using git-- people might have downloaded a zip file, or be using another type of repo, and I want it to be possible in principle to still build. So it should do something reasonable in that case; a .version file might be necessary anyway as a backup. Also, do people by default download all the tags when they clone the repo from github? I would have thought they just clone the master.

Hmm. I haven’t thought of the non-git based setups. Having a base/.version as a backup makes sense. I suppose in those setup we simply print whatever is inside base/.version.

Yes, cloning by default includes all tags.

The issue with relying too strongly on tags, and this also speaks to your comment about relying on tags in commit messages, is-- I don't want to have to add a tag for every single commit to the kaldi repo. That would lead to an unacceptable number of tags accumulating. So we can't rely purely on tags to figure out whether the recent history was part of the "official" kaldi repo. I'm open to adding tags for all the major/minor version numbers though.

I wasn’t suggesting adding a tag on each commit. I agree with your proposal of only marking major.minor number and figuring out patch number locally.

1. 2.

How do we count how many commits there are between the checkpoint commit (corresponding to the version in base/.version) and the HEAD if the history is not linear?

This isn't important- we can just decide, and put some kind of markers in the string to indicate that something was unusual.

1.

I don't like depending on commit messages to figure out version information. It seems like a fragile convention and I have a feeling it will put unnecessary strain on maintainers for little gain.

The problem that I was trying to solve with that is-- when people ask questions on the list I want to get a sense for what version of Kaldi they are using and (approximately) how much they have modified it from the upstream version-- and I don't want to tag every single Kaldi commit; and I want it to do something reasonable in various merge scenarios. For example, if I were to see a version number like 5.1.12+3u6 indicating that the most recent "official" version they merged with was 5.1.12, but they have 3 user commits, and 6 untracked files, it would give a lot of information.

The solution doesn't have to do exactly the right thing in all scenarios, it just has to do enough to be useful, and degrade reasonably gracefully (e.g. I don't want situations where it goes trawling through the entire git history looking for something that might not be there).

Think about it and see if you can come up with a plan.

I’ll get started with what git-describe provides by default and we can go from there.

I have a feeling we have a case of feature creep here :) Maybe we should address this problem of identifying local modifications separately from versioning, e.g. with a debugging script that will print all relevant info.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269394268>, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAl2vte7glUaOLP7LP2Pm42HZQ8K26wqks5rMZbSgaJpZM4KuA_V.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269399377, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuz02UMgR3EmMfuh_cvHOpArNu9Kmks5rMaQFgaJpZM4KuA_V .

vdp commented 7 years ago

I think, that this debugging script, proposed by Dogan(maybe something like misc/bugreport.sh?) would be nice to have in any case, for determining if the user has made changes to the scripts, after the binaries were compiled. In many cases this will result in exceptions in the binaries, due to incorrect input, but not always.

IMO having a non-Git distribution option would be a mistake, because it would make maintainers' work harder, for no clear benefit. The GitHub releases look nice, because they would provide a place for say posting the changelog, but I think if we maintain such a log, it can be linked from the project's README.md on github and/or formatted as a Doxygen document etc..

BTW guys, perhaps it would be easier to pin down the requirements and discover flaws in our reasoning if we write down the full story of how we imagine the versioning will work. Dan, if you agree with this, perhaps you can start writing a markdown document (e.g. in src/doc), and open a pull request. That way if someone has an idea or notice a problem, (s)he will be able write inline comments(i.e. the same process already used to review code). As an example:

The principal stakeholders for the project are the project M(aintainer), stable B(ranch) maintainer, D(eveloper) and U(ser).

Initial state of the source tree:

--a--b-->(HEAD of master) \ \
- v5.0.0 - v5.0.1-->(HEAD of Kaldi 5.0)

(the patch tags don't really exist; in the repo these are just ordinary commits)

D notices a bug, and opens a pull request against master
M reviews the pull request, and merges it in master, with a comment like

BUGFIX[5.0]:

to notify B that (s)he needs to apply this bugfix to the stable branch

And so on and so forth. Just an idea that occurred to me, feel free to ignore..

jtrmal commented 7 years ago

I'd be in favor of by-default automatic versioning, i.e. each month a new tag would be created -- 2016.12.0, 2017.01.0 and so on. In cases when the version has to be bumped up manually, we can bump the sub-minor version, i.e. create something 2017.01.1,, 2017.01.2 and so on. The final version might be something like 2017.01.01~5 (#shorthash) where 5 is number of commits from 2017.01.01..HEAD and #shorthash is the HEAD checksum Or we can do as @vdp was mentioning and count the versions starting from the year of conception. I don't particularly care On overall I think the smarter versioning the more hassle to keep it consistent or even maintain it at all.

To get things rolling, I can investigate that -- I have already some infrastructure on openslr, including hooks.

What I'm not sure is what happens if a user makes git clone and in his/her local clone changes the master. SVN was quite straightforward but I'm not sure about the git's decentralized layout y.

On Wed, Dec 28, 2016 at 9:34 AM, Vassil Panayotov notifications@github.com wrote:

I think, that this debugging script, proposed by Dogan(maybe something like misc/bugreport.sh?) would be nice to have in any case, for determining if the user has made changes to the scripts, after the binaries were compiled. In many cases this will result in exceptions in the binaries, due to incorrect input, but not always.

IMO having a non-Git distribution option would be a mistake, because it would make maintainers' work harder, for no clear benefit. The GitHub releases look nice, because they would provide a place for say posting the changelog, but I think if we maintain such a log, it can be linked from the project's README.md on github and/or formatted as a Doxygen document etc..

BTW guys, perhaps it would be easier to pin down the requirements and discover flaws in our reasoning if we write down the full story of how we imagine the versioning will work. Dan, if you agree with this, perhaps you can start writing a markdown document (e.g. in src/doc), and open a pull request. That way if someone has an idea or notice a problem, (s)he will be able write inline comments(i.e. the same process already used to review code). As an example:

The principal stakeholders for the project are the project M(aintainer), stable B(ranch) maintainer, D(eveloper) and U(ser).

-

Initial state of the source tree:
--a--b-->(HEAD of master)
 \
   \
     - v5.0.0 - v5.0.1-->(HEAD of Kaldi 5.0)
(the patch tags don't really exist; in the repo these are just ordinary commits)

-

D notices a bug, and opens a pull request against master

M reviews the pull request, and merges it in master, with a comment like
BUGFIX[5.0]: <msg>
to notify B that (s)he needs to apply this bugfix to the stable branch

And so on and so forth. Just an idea that occurred to me, feel free to ignore..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269444386, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisX0O8tVj2Hg3Rgl8gzTQOBq-GFCxKks5rMh8DgaJpZM4KuA_V .

danpovey commented 7 years ago

@jtrmal, sorry, by now I'm set on having a semantic-versioning style of versioning. Versioning by date like that will make things weird in case we want to apply bug-fixes to older branches. Imagine that at some point in the future we decide we want to rebuild core parts of Kaldi from the ground up but maintain fixes to the older version of Kaldi for people that have built stuff around it. Having semantic versioning will make such things much more natural. The way I plan to do the versioning, it won't be as painful as you think.

danpovey commented 7 years ago

OK guys. I think I've come up with a form of versioning where it will be trivial to figure out the version number. The major/minor version number is stored in src/.version; for example, 5.1. The patch number is simply the number of commits since the last time src/.version was revised, which we can figure out with git commands git log -1 src/.version to get the commit, and git rev-list [that commit]..HEAD | wc -l. We don't attempt to distinguish between user-level and official commits in working out the patch number, we make that a separate issue. But we might add ~7 to the version number to indicate that there are 7 untracked changes to versioned files in src/.

For downloads as zip files, if we ever choose to enable that, we'd just manually add the minor version number.

We'll eventually keep track of versions using branches and tags, but we can worry about the specifics of that later.

Everyone happy?

jtrmal commented 7 years ago

Ok for me. Y.

On Dec 28, 2016 10:18 PM, "Daniel Povey" notifications@github.com wrote:

OK guys. I think I've come up with a form of versioning where it will be trivial to figure out the version number. The major/minor version number is stored in src/.version; for example, 5.1. The patch number is simply the number of commits since the last time src/.version was revised, which we can figure out with git commands git log -1 src/.version to get the commit, and git rev-list [that commit]..HEAD | wc -l. We don't attempt to distinguish between user-level and official commits in working out the patch number, we make that a separate issue. But we might add ~7 to the version number to indicate that there are 7 untracked changes to versioned files in src/.

For downloads as zip files, if we ever choose to enable that, we'd just manually add the minor version number.

We'll eventually keep track of versions using branches and tags, but we can worry about the specifics of that later.

Everyone happy?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269542290, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisX2EF1Ajl7DIwYBqdjPuIq6zZdzOiks5rMtIMgaJpZM4KuA_V .

vdp commented 7 years ago

OK with me too, FWIW. Using Git tags instead of ".version" would have been more idiomatic probably, but the proposed implementation is simple and on the face of it looks like it could work, so..

The distinction b/w "official" and local commits would be probably important for the maintainers, because obviously it matters whether the user has version 5.0.7 or version 5.0.2 with 5 custom commits on top of it. One clumsy, but seemingly workable way, to determine the local-only commits could be as follows:

use "git remote -v" to determine the remote name for the official repo
iterate over the commits returned by the "git rev-list" command you propose, and issue "git branch -r --contains [hash]>" for each of them. For the "official" commits it will return a list of branches that include ${offical_remote}/[some-branch].

vdp commented 7 years ago

Actually the above method seems to be slowish and could be problematic when you have to check a lot of commits(e.g. when you are working with the 'master' for which .version hasn't been updated for a few months). Instead, you can try:

    git cherry ${official_remote}

danpovey commented 7 years ago

OK, for now I want to treat the issue as decided, that we'll have a src/.version file, and we'll use commands like git log -1 src/.version to get the commit, and git rev-list [that commit]..HEAD | wc -l to figure out the patch number. We will later create tags too, but tags are a little more problematic in cases when, for instance, there are multiple remotes or people cloned just a specific repo from the remote, not the whole repo. So tags won't be the mechanism that the script that gets the version number uses to find the patch number.

The version number (and also the git hash of the most recent commit) will be included in base/version.h in the way we previously discussed, e.g.

// version number of 5.0.0 would mean a clean src/ directory;
// 5.0.0~9 would mean the src/ directory has 9 untracked changes
// to versioned files in src/
#define KALDI_VERSION_NUMBER "5.1.1~7"
// shortened git hash of current commit
#define KALDI_GIT_HASH_SHORT "a9bd1"
// long git hash of current commit
#define KALDI_GIT_HASH "a9bdebb314913d90d23bb89d45ad3bfc34dd0a9d6cfg21"

We can later add more information, e.g. about which files have untracked changes, to be printed when someone uses the --version option. (and we can later add the version option).

For now, let's just make the following change to the logging output. The FIRST TIME a program prints a log, warning, error message or assert failure, instead of, say: LOG (nnet3-merge-egs:main():nnet3-merge-egs.cc:126) Merged 4000 egs to 125. it will look like this: LOG (nnet3-merge-egs[5.1.13~7,a9bd1]:main():nnet3-merge-egs.cc:126) Merged 4000 egs to 125.

@dogancan, do you have time to work on this? We can start with a version of 5.0.0. Don't implement the --version option, that's easy and I can easily find someone to do that.

dogancan commented 7 years ago

Yes, I will start soon and when I do I will also open a WIP pull request for comments.

On Dec 29, 2016, at 2:24 PM, Daniel Povey notifications@github.com wrote:

OK, for now I want to treat the issue as decided, that we'll have a src/.version file, and we'll use commands like git log -1 src/.version to get the commit, and git rev-list [that commit]..HEAD | wc -l to figure out the patch number. We will later create tags too, but tags are a little more problematic in cases when, for instance, there are multiple remotes or people cloned just a specific repo from the remote, not the whole repo. So tags won't be the mechanism that the script that gets the version number uses to find the patch number.

The version number (and also the git hash of the most recent commit) will be included in base/version.h in the way we previously discussed, e.g.

// version number of 5.0.0 would mean a clean src/ directory; // 5.0.0~9 would mean the src/ directory has 9 untracked changes // to versioned files in src/

define KALDI_VERSION_NUMBER "5.1.1~7"

// shortened git hash of current commit

define KALDI_GIT_HASH_SHORT "a9bd1"

// long git hash of current commit

define KALDI_GIT_HASH "a9bdebb314913d90d23bb89d45ad3bfc34dd0a9d6cfg21"

We can later add more information, e.g. about which files have untracked changes, to be printed when someone uses the --version option. (and we can later add the version option).

For now, let's just make the following change to the logging output. The FIRST TIME a program prints a log, warning, error message or assert failure, instead of, say: LOG (nnet3-merge-egs:main():nnet3-merge-egs.cc:126) Merged 4000 egs to 125. it will look like this: LOG (nnet3-merge-egs[5.1.13~7,a9bd1]:main():nnet3-merge-egs.cc:126) Merged 4000 egs to 125.

@dogancan https://github.com/dogancan, do you have time to work on this? We can start with a version of 5.0.0. Don't implement the --version option, that's easy and I can easily find someone to do that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269703029, or mute the thread https://github.com/notifications/unsubscribe-auth/AAl2vtBMgNflmnkpfXB7WYrI2-FJ_Hb8ks5rNDM1gaJpZM4KuA_V.

danpovey commented 7 years ago

@dogancan, do you have any progress on this? I know it's New Year's Day, but there are a few major changes that I don't want to check in before we've got the versioning squared away (your C++11 PR, and some fairly large changes to the nnet3 setup that I'm working on).

dogancan commented 7 years ago

I started working on it. I don’t think it will take long before I open a pull request with a bare bones implementation.

On Jan 1, 2017, at 3:00 PM, Daniel Povey notifications@github.com wrote:

@dogancan https://github.com/dogancan, do you have any progress on this? I know it's New Year's Day, but there are a few major changes that I don't want to check in before we've got the versioning squared away (your C++11 PR, and some fairly large changes to the nnet3 setup that I'm working on).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-269923549, or mute the thread https://github.com/notifications/unsubscribe-auth/AAl2vmq3D-URkXhgw_4O7GF5S7S6_pGXks5rODAJgaJpZM4KuA_V.

danpovey commented 7 years ago

@dogancan, I'm getting this on my mac: ./get_version.sh: line 50: syntax error in conditional expression: unexpected token '(' Code is: elif [[ $version != +([0-9]).+([0-9]) ]]; then bash --version GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin15) Copyright (C) 2007 Free Software Foundation, Inc.

dogancan commented 7 years ago

Ah, my bad. Just opened #1327 to address this.

kkm000 commented 7 years ago

Sorry, I am late to the party, as usual. FWIW, the standard way to ID releases under Git is git describe. The "long" format is <tag>-<num_commits>-g<sha1>, for example x104-0-g736578a means 0 commits from (=exactly at) tag x104, commit hash 736578a. The "short" format, where only the <tag> is used if num_commits is 0. For the above, short descriptive format would be just x104.

This is close to what @danpovey suggested, but more precise (it is possible that n>0 commits from given tag version is not ambiguous because of multiple graph paths inheriting given commit. The format does not track working directory modifications though.

A tag somewhere up the history is essential. git describe can be forced into using a branch name, but it makes no sense, since branch head is by definition volatile. A tag can also be forcibly moved, this is why the long format is considered more stable.

Git parses back any format of the output of git describe as its committish, e. g. git checkout x104-0-g736578a works.

danpovey commented 7 years ago

Hm, it might be possible to move to that convention, at least I wouldn't object (although it's not the highest on my priority list right now).

It would require that the major/minor versions like 5.0, 5.1 and so on be identified with tags, which I suppose wouldn't be a problem-- although we currently have branches corresponding to each version, named "5.0", "5.1" and so on, so we'd have to choose a different name for the tag, I guess, to avoid a conflict. [the branches, of course, point to the most recent commit for each major/minor version.]

On Wed, Apr 12, 2017 at 4:28 PM, Kirill Katsnelson <notifications@github.com

wrote:

Sorry, I am late to the party, as usual. FWIW, the standard way to ID releases under Git is git describe. The "long" format is
--g, for example x104-0-g736578a means 0 commits from (=exactly at) tag x104, commit hash 736578a. The "short" format, where only the is used if num_commits is 0. For the above, short descriptive format would be just x104. This is close to what @danpovey suggested, but more precise (it is possible that *n*>0 commits from given tag version is not ambiguous because of multiple graph paths inheriting given commit. The format does not track working directory modifications though. A tag somewhere up the history is essential. git describe can be forced into using a branch name, but it makes no sense, since branch head is by definition volatile. A tag can also be forcibly moved, this is why the long format is considered more stable. Git parses back any format of the output of git describe as its *committish*, e. g. git checkout x104-0-g736578a works. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

kkm000 commented 7 years ago

Just technically you can have both tag and branch by the same name, but would not.

A prefix v, like v5.0 to a tag?

dogancan commented 7 years ago

I think we already discussed using git tags earlier in this thread and eventually settled on using the src/.version file so that versioning would do something reasonable even when kaldi installation is not a git repo.

kkm000 commented 7 years ago

@dogancan: Right, I missed that. I do not undestand one thing though. The patch number by counting commits is unavailable from the .version file alone, w/o a Git repository, and base/version.h will have to be pre-built anyway for such a distro. So why the .version file is needed at all, except naming a specific commit with its content. Looks like a homebrewn Git tag--am I missing any other function of this file?

Setting the tag at the commit that changes the .version file makes git describe actually report the patch number per Dan's counting scheme.

danpovey commented 7 years ago

oh I forgot that.

On Wed, Apr 12, 2017 at 4:36 PM, Dogan Can notifications@github.com wrote:

I think we already discussed using git tags earlier in this thread and eventually settled to using this base/.version file so that versioning would do something reasonable even when kaldi installation is not a git repo.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1179#issuecomment-293736589, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu3ikuJr91Ip4EYpwmrVD044bc3-Uks5rvWAUgaJpZM4KuA_V .

dogancan commented 7 years ago

@kkm000 If kaldi installation is a git repo, src/.version is not needed. The scheme we have now is indeed a homebrewn git tag implementation. I think the reasoning was DRY: since we need src/.version for non-git installations, let's not introduce another base version tagging scheme via git tag which may lead to inconsistent version numbers if the maintainer forgets to increment both sources of base version numbers simultaneously.

kkm000 commented 7 years ago

@dogancan I see. I should go through and absorb this discussion more carefully. You already mentioned git-describe, and I mainly chimed in because I was under an impression it was not mentioned.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.