Closed flippedcoder closed 8 months ago
This is by design and is documented: https://dvc.org/doc/command-reference/metrics#supported-file-formats (yaml 1.2 is also specified for plots and params)
See also https://github.com/iterative/dvc/issues/4281 and the associated PR
Regarding the checkpoints tutorial, we should really be using ruamel and not PyYAML in any of the iterative example repos for this specific reason.
@skshetry I know you suggested this issue. Was there something you had in mind?
Edit: I missed your response in Slack, which I'm pasting below.
@Dave Berenbaum, it's more of a compatibility issue. The data science community and the python ecosystem in general is stuck in YAML1.1. We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect --set-param, not the dvc.lock and yaml files).
Also related to #5777. Both issues seem to stem from dvc parsing the provided values instead of replacing them as is.
The data science community and the Python ecosystem in general is stuck in YAML1.1.
Agreed, and it's not obvious when you get the error that it's a YAML 1.1/1.2 issue. We can change the tutorials, but it doesn't help users who run into the same error.
We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides.
That seems like a viable solution.
We can change the tutorials, but it doesn't help users who run into the same error.
Yes. The tutorials can help mitigate the issue but it isa not a solution. A good portion of users will get the error even the ones who do not use Python at all.
DVC is not compatible with yaml 1.1
This is by design and is documented
Agree. Maybe the problem (or part of) begins with the crash and tracelog, which seems to indicate that the error happened in the user code. I also wonder what it would look like if the user code wasn't Python...
Could we start by catching it somehow and presenting a meaningful error message like ERROR: Please update your 'param.yam' file to YAML 1.2
? To my mind that would remove the "bug" part from this issue.
The error here is happening in the (example) user code - the params.yaml is already YAML 1.2, the problem is that the user tries to open it with PyYAML (which only supports 1.1).
I see. I thought this only happened when the user code is run via exp run
(not manually, for example).
Yeah nvmd I got it now: exp run -P
makes the params file YAML 1.2, which causes the cmd
to fail.
Since 0.000001
is valid in both YAML 1.1 and 1.2, how is a user supposed to know that DVC will convert this to 1e-06
? I don't think it's clear how the values passed to -S
get replaced. What about non-YAML params files where it's not obvious that the values in -S
are first parsed as YAML? Are the specs for all of the supported formats so similar that this will never be an issue?
Other than the CLI value for -S
being parsed as YAML, this issue doesn't affect non-YAML params files. They are written back into the params file using libraries specific to each format, since each format has its own specification for how a floating point number should be serialized.
The point is that however ruamel decides to serialize 0.000001
(in decimal or exponential form) it should not matter since they are both equivalent and valid YAML 1.2. If the user wants to use YAML params files, they should be processing them as YAML 1.2 in their own code.
edit: regarding parsing scalar values passed to -S
according to YAML 1.2 rules using ruamel, this should not cause any issues with users writing values "intended" to be JSON scalars. YAML 1.2 specifically exists to bring it in line with JSON (JSON is a formal subset of YAML 1.2). Parsing them as YAML 1.1 would cause issues, as 1e-06
is a float in JSON (and YAML 1.2), but is a string in YAML 1.1
for TOML, they only implement a partial set of the JSON rules for parsing scientific notation. For any users that are actually using TOML params, they will likely be entering floats as decimals in the first place (because of TOML's very limited support for scientific notation), in which case we will still handle them properly.
Also, regarding:
We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect --set-param, not the dvc.lock and yaml files).
I don't think having a modified YAML serialization library specific to --set-param
is the answer. It does not make sense for DVC to be mixing different YAML versions between params files modified with --set-param
, dvc.lock
and dvc.yaml
If the user wants to use YAML params files, they should be processing them as YAML 1.2 in their own code
But I think that is the problem. We are expecting/forcing users to do that.
I don't see why that is a problem, or how it is any different from how we also force users to format the data within their JSON/Python/TOML params and metrics files in a specific way.
e: If the user's existing params or metrics data is structured with a list as the top-most node, we force them to change it, as DVC params/metrics only work if the top-most node is a key:value dictionary. IMO requiring a specific data format (YAML 1.2) is just as reasonable as requiring that the data be structured in a specific way. Either way, we are forcing the user to make changes to their existing code for handling their params/metrics.
Also, regarding:
We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect --set-param, not the dvc.lock and yaml files).
I don't think having a modified YAML serialization library specific to
--set-param
is the answer. It does not make sense for DVC to be mixing different YAML versions between params files modified with--set-param
,dvc.lock
anddvc.yaml
I don't think the suggestion requires mixing YAML versions since the proposed format is valid YAML 1.2, nor does it have to only apply to --set-param
. Is there some spec or standard about the preferred format for floats in YAML 1.2? See https://yaml.org/spec/1.2/spec.html#id2804092, which details the regex match for floats but doesn't recommend any format. If not, it seems like this is just a choice made by ruamel.yaml
(which as far as I can tell just uses the Python repr/str format), which isn't apparent to users.
I know it's messy to have DVC define the format (maybe we can raise the idea of a backwards-compatible style with the ruamel.yaml
maintainers if you feel that strongly), but I don't think we can ignore the usability concerns strictly out of principle. I can't even find a YAML 1.2 parser for R, for example. Open to other solutions, but I don't think doing nothing is an option.
So are we going to document that params and metrics files should be YAML 1.1, but dvc.lock
and dvc.yaml
should be YAML 1.2? This seems like a weird inconsistency for DVC to have.
Also, what do we do if a user does dvc run --set-param 1e-06
? If we are going off of YAML 1.1 rules, this is explicitly a string. If we then go and serialize this as a string in a JSON file, it will break the user's code.
I don't see why that is a problem
@pmrowla this was deemed problematic by leadership AFAIR. I would agree that it's not great UX, especially if YAML 1.1 is still dominant (e.g. yaml
Python module). And it's very hard to understand what went wrong in these situations.
we also force users to format the data within their JSON/Python/TOML params and metrics files
Not sure that's formatting if you're referring to data structure / schema.
requiring a specific data format is just as reasonable as requiring that the data be structured in a specific way
Schema reqs are separate from the format version, I think. The problem here is that exp run -S
updates a dependency (in a way), and the project may not be able to support it.
But good Q: can the same issue happen with other formats? e.g. you use a py2 params file and exp run -P
turns it into py3. If so maybe this issue it a bit broader than just YAML...
document that params and metrics files
shouldcan be YAML 1.1, but dvc.yaml should be YAML 1.2? This seems like a weird inconsistency
🤺 good point, that does seem a little inconsistent. However I don't think dvc.yaml (a DVC-specific schema) and params/metrics files (glorified INI files) or plots files (data maps) are in the same category. We could simply not specify a version for all the latter (we currently don't for plots BTW).
what do we do if a user does dvc run --set-param 1e-06
🤺🤺🤺
And it's very hard to understand what went wrong in these situations.
Is there at least something we can do about this ☝️ ?
🤺 good point, that does seem a little inconsistent. However I don't think dvc.yaml (a DVC-specific schema) and params/metrics files (glorified INI files) or plots files (data maps) are in the same category. We could simply not specify a version for all the latter (we currently don't for plots BTW).
It does matter for params/metrics/plots though, since the YAML version affects whether or not some values are treated as floats or strings. We show (numeric) deltas in diffs, and we plot numeric values. Whether the parsed type of 1e-06
is a supposed to be string or a float affects how it can (or cannot) be diffed and plotted.
The type also affects DVC pipeline behavior. If a DVC-tracked YAML 1.2 param dependency is written to dvc.lock
as the floating point value 0.000001
, changing it to 1e-06
in params.yaml
would not trigger the stage to be run during dvc repro
(since the param values are numerically equivalent). If we are handling params as YAML 1.1, then changing it to 1e-06
in params.yaml
should trigger the stage to be rerun, since the floating point number 0.000001
is not equal to the string "1e-06"
Is there at least something we can do about this ☝️ ?
There is no way for us to tell what YAML version a user intended unless they explicitly use the optional
%YAML 1.1
or %YAML 1.2
identifier at the top of their files (which no one ever uses, and almost no YAML serialization library writes by default).
Just to be clear, I'm not a data scientist, and at the end of the day it doesn't really matter to me whether we use YAML 1.1 or 1.2. The main issue here is that we need to pick one and explicitly tell people that's what they need to use. Trying to halfway support both 1.1 and 1.2 should not be an option for the reasons I've already outlined.
I do think it makes more sense for us to use 1.2 since 1.2's parsing rules are consistent with JSON's, and removes any ambiguity about how exp run --set-param
values should be parsed. (And we already had a similar 1.1 vs 1.2 discussion 6 months ago and decided to go with 1.2, and documented that decision)
We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect --set-param, not the dvc.lock and yaml files).
@pmrowla This suggestion still uses YAML 1.2. I don't think anyone has suggested that DVC use YAML 1.1, just that users may be doing so.
The only issues I can find with this suggestion are:
ruamel.yaml
project since it doesn't look like there's any particular justification for their current format, and I don't see much value in being intentionally backwards-incompatible when equally viable backwards-compatible solutions exist.--set-param
and other files like dvc.lock
. Again, I'm open to discussing whether it should apply everywhere.Let me know if you see other issues or have other suggestions.
@dberenbaum it still doesn't address the issue of how DVC should parse exponents that a user enters without the dot on the command line via --set-param
or in YAML params/metrics files (when diffing/plotting)?
If a user that is using YAML 1.1 enters --set-param foo=1e-06
we will parse it as a YAML 1.2 float. On serialization (with the proposed change), we will we change it to foo: 1.0e-06
to keep backwards compatibility with YAML 1.1's float syntax. Except that what we really just did was break the user's YAML 1.1 file by changing an explicit string to an explicit floating point number
I don't think anyone has suggested that DVC use YAML 1.1, just that users may be doing so.
This is the actual problem (that the expectation is it's ok for DVC to use 1.2 while users are on 1.1). Setting aside the floating point edge case discussion, if DVC is not using the same YAML version as the user, and tries to do "backwards compatible YAML 1.2", it introduces ambiguity and potentially unexpected behavior with regard to pipeline reproduction.
If a user is mixing YAML 1.1 boolean strings in their params files it will break DVC behavior. In YAML 1.1, yes/true
and no/false
are supposed to be equivalent when DVC evaluates whether or not a param dependency has changed (and a stage needs to be rerun). If DVC just parses everything as YAML 1.2, yes
and true
are no longer equivalent, since yes
is now an explicit string and not an explicit boolean.
Even if the user is not mixing yes/true
usage, if they forget to use consistent capitalization it will still be an issue, since by YAML 1.1 rules, Yes
and yes
are equivalent (both evaluate to boolean True), but by YAML 1.2 rules they are not (both are explicit non-equal strings)
Basically, yes, modifying DVC's YAML serializer to include the dot when writing floats in scientific notation will close this particular ticket. But this is not an actual solution to the underlying problem, which is that YAML 1.1 and 1.2 are two different things (that differ in more ways than just "serializing scientific notation") and that if we try to support both at the same time, there will be more tickets just like this one that get raised in the future.
I can't even find a YAML 1.2 parser for R, for example.
But R does have more than one JSON parser, which is conveniently another supported params/metrics format in DVC, and does not have any of the version/inconsistency issues you get with YAML.
On the Q of scientific notation, do we actually expect users to need that? If not and if not allowing that would reduce the UX problems, I think we should consider it.
FYI, I updated dvc-get-started
to use ruamel.yaml
. The older example-get-started
still uses PyYAML
.
If a user that is using YAML 1.1 enters
--set-param foo=1e-06
we will parse it as a YAML 1.2 float. On serialization (with the proposed change), we will we change it tofoo: 1.0e-06
to keep backwards compatibility with YAML 1.1's float syntax. Except that what we really just did was break the user's YAML 1.1 file by changing an explicit string to an explicit floating point number
I think at some point DVC has to accept that it cannot always avoid being opinionated. It's probably fair to assume that 1e-06
/1.0e-06
/0.000001
should all always be interpreted as floats irrespective of YAML version (I do realise that may be hard to reliably implement).
We could have --set-param-str foo=1e-06
to turn off auto typecasting?
I agree that it is urgent yet. It is not hard to debug, even for new machine learning Engineers, the exception is not raised from dvc but from users script.
To summarize our meeting, there are two approaches:
1) Create a hack that will make dvc write 1.1 compatible numbers for that edge case. We could invest some time into it to research further, but so far:
2) Stay consistent with 1.2, but add more warnings in docs (e.g. get started)
Considering that dvc behavior is consistent and documented, 2) looks wiser right now, but we could wait and collect more info from users.
not hard to debug ... the exception is not raised from dvc but from users script
I'd argue that it's difficult to degug in the sense that understanding why your code which previously worked now has some bug after using dvc exp
. It looks as it DVC broke it. E.g. in this particular case the traceback TypeError: '<=' not supported between instances of 'float' and 'str'
isn't a very clear indication of the root cause. And what if the code isn't even yours, but some library you have no intention or ability to modify?
That said have users reported this problem? Otherwise maybe it's not very common.
- Create a hack that will make dvc write 1.1 compatible numbers for that edge case.
What about
Assuming it's feasible to implement, it keeps DVC agnostic, is good for the user, is not a hack, and doesn't encourage 1.1.
As far as I can tell, here’s everything a user needs to know to debug this error:
This breaks down into two problems:
I'd argue that it's difficult to degug in the sense that understanding why your code which previously worked now has some bug after using
dvc exp
. It looks as it DVC broke it.
Again I would like to stress here that the user's code "worked" before using dvc exp
only in the sense that they would successfully parse their params file using a YAML 1.1 parser. However, their "working" code was already broken/incompatible with DVC in the sense that parsing things as YAML 1.1 will cause the unexpected pipeline behavior explained in https://github.com/iterative/dvc/issues/5971#issuecomment-836041272
there are already a couple of other issues where parsing causes other issues (#5777 and #5868).
I wouldn't consider either of those related to this issue
--set-param
, putting an ISO formatted date directly into params.yaml
also breaks DVC
- Detect the version of the YAML file
This is essentially impossible
I think one part of the problem is the difference between YAML 1.1 and YAML 1.2 looks minuscule from the user POV.
It should have been called YAML 2 if it's backward incompatible. I feel bad when technically half-baked formats become so popular. Markdown is another such case.
Is there a way to support both YAML 1.1 and 1.2, in all .yaml
files, including dvc.yaml
and dvc.lock
? What are the edge cases?
Supposing, hypothetically, we create another markup language compatible with both versions, what are the ambiguous cases to decide? Can we discuss and document them one-by-one? Is it feasible?
5777 isn't a parsing issue, it's a "DVC support for Python params files is bad & incomplete" issue
For #5868, Pawel already noted that the bug is not specific to parsing with
--set-param
, putting an ISO formatted date directly intoparams.yaml
also breaks DVC
They are not specific to --set-param
, but I thought they were about parsing. I thought that all of these issues occur because DVC parses the provided values, potentially changing their representation in the process, rather than taking the values as string literals. From what I read:
1e-8
to 1e-08
, which causes it not to replace 1e-8
in that file.2021-01-01
as a non-primitive (date
) in instead of reading in the literal 2021-01-01
.Am I misunderstanding?
There's nothing inherently wrong about the current parsing implementation, and it has value (like sorting experiments and deduplicating params values with different formatting). I just wanted to clarify how I see these issues as related, and that it is nonobvious to users that DVC doesn't replace the text as is, which can lead to unexpected results.
I feel bad when technically half-baked formats become so popular.
being half-baked is a feature not a bug :) Same reason why Esperanto probably won't manage to replace English as lingua franca
being half-baked is a feature not a bug :) Same reason why Esperanto probably won't manage to replace English as lingua franca
Then there is no reason to perpetuate this feature in DVC maybe :) Natural languages are inherently half-baked, because our understanding of the world we live in is half-baked, but let's not dive into this discussion here :)
If it's possible to cover differences between 1.1 and 1.2 in a sensible manner within a feasible update, it may be better not to leave some users in buggy cold hands of YAML 1.1. Otherwise, let them learn to upgrade their code. I just would like to know how much work would be needed to cover the edge cases before making my mind.
To summarize the problem is the perceived UX. It can feel as if DVC is breaking your code and it's unclear what actually happened from the traceback. I also agree not to support YAML 1.1
But if there's no way to prevent or catch these situations in order to present a clear warning or error message, then the only option left is to be very emphatic in all the related docs *and* in the command help outputs that we hate DVC is not compatible with YAML 1.1
Create a hack that will make dvc write 1.1 compatible numbers for that edge case. We could invest some time into it to research further, but so far:
@efiop how hard it would be to implement this hack?
PS: I'm wondering about this issue since in some projects the transition to yaml 2.0 might be not as straightforward as in our simple code examples. Also, some of the arguments above assume the user's code is written in Python which might be also not correct in some cases (OpenCV library uses yaml 1.0, not even 1.1.).
Another occurrence of this issue and a nice summary of the problem:
OK, I think I understand now. The conclusion is that the
params.yaml
you produce requires a YAML 1.2 compatible reader. And PyYAML is not, which is what I use in my code currently. Switching to ruaml.yaml will fix it. Got it :-)
Originally posted by @aschuh-hf in https://github.com/iterative/dvc/issues/8466#issuecomment-1290757564
BugReportdvc exp run --set-param lr=0.000001
Here's the error:
Description
The problem is that the
yaml
library only supports yaml 1.1 when DVC uses yaml1.2. This issue seems to happen when exponents are generated.The data science community and the Python ecosystem in general is stuck in YAML1.1. We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect
--set-param
, not the dvc.lock and yaml files).Reproduce
from ruamel.yaml import YAML
withimport yaml
yaml=YAML(typ='safe') params = yaml.load(f)
withparams = yaml.safe_load(f)
lr
in params.yaml should be updated to an exponent and you get the error aboveExpected
Running
dvc exp run --set-param lr=0.000001
should start running the experiment with the newlr
Environment information
Output of
dvc doctor
: