time-rs / time

The most used Rust library for date and time handling.
https://time-rs.github.io
Apache License 2.0
1.09k stars 273 forks source link

parse with FormatItem::First fails early on UnexpectedTrailingCharacters #566

Closed tristan closed 1 year ago

tristan commented 1 year ago

With the following FormatItem::First setup, an UnexpectedTrailingCharacters is returned due to the 2nd item parsing successfully but having remaining characters, even though the 3rd item would parse successfully.

use time::{format_description::FormatItem, macros::format_description, PrimitiveDateTime};

const TIMESTAMP_FORMAT_1: &[FormatItem] =
    format_description!("[year]-[month]-[day] [hour padding:none]:[minute]:[second]");

const TIMESTAMP_FORMAT_2: &[FormatItem] =
    format_description!("[day]/[month]/[year] [hour padding:none]:[minute]:[second]");

const TIMESTAMP_FORMAT_3: &[FormatItem] = format_description!(
    "[month padding:none]/[day padding:none]/[year] [hour padding:none repr:12]:[minute]:[second] [period]"
);

const TIMESTAMP_PARSER: FormatItem = FormatItem::First(&[
    FormatItem::Compound(TIMESTAMP_FORMAT_1),
    FormatItem::Compound(TIMESTAMP_FORMAT_2),
    FormatItem::Compound(TIMESTAMP_FORMAT_3),
]);

fn main() {
    let input = "12/12/2022 10:20:29 AM";
    dbg!(PrimitiveDateTime::parse(input, &TIMESTAMP_PARSER));
}

Link to playground

My expectation using this would use the 3rd item and ignore the UnexpectedTrailingCharacters error unless the 3rd item also failed to parse.

The simple solution for my specific case is to flip the 2nd and 3rd items, since they're unique enough for my use case. But I thought it worth reporting in case it is a bug and not the expected behaviour, or if i'm doing something wrong/unexpected with the api.

jhpratt commented 1 year ago

Thanks for the report. I believe your analysis of why the error returned is correct. This may be tricky to handle, as it's two separate phases of parsing. The only thing that immediately comes to mind is if characters are remaining, further options could be tried to see if any succeed without characters remaining.

jhpratt commented 1 year ago

If I add a modifier try_complete:true/try_complete:false, with the default being false, would you view that as a reasonable solution?

The current behavior is false — the first match is returned no matter what. If it's true instead, then the parser would try to find the first match that has no input remaining, falling back to the first overall match otherwise.

tristan commented 1 year ago

It's not clear from your message where the modifier would sit. I assume in one of the format_description! entries, but it's not clear which one, or where it would sit in the description. It could make sense to change TIMESTAMP_FORMAT_2 to know it needs to consume all input, otherwise fail (kinda like adding $ at the end of a regex), so I expect you mean to make my example problem work by changing that line to

const TIMESTAMP_FORMAT_2: &[FormatItem] =
    format_description!("[day]/[month]/[year] [hour padding:none]:[minute]:[second][try_complete:true]");

or

const TIMESTAMP_FORMAT_2: &[FormatItem] =
    format_description!("[day]/[month]/[year] [hour padding:none]:[minute]:[second try_complete:true]");

?

In my opinion this feels a bit over complicated, and still wouldn't stop people from making the same mistake without having clear documentation about this limitation of FormatItem::First and that the try_complete:true modifier should be used to circumvent it. It would also be tricky to use in cases where you want to use existing format descriptions in a FormatItem::First list without recreating them. e.g. the well_known items? though my hypothetical use cases are likely a bit of a stretch, but that also makes me feel like it's a bit much to introduce a new, and likely complex to implement, modifier for such an edge case.

Rather, I think it's important to think about the difference of using FormatItem::First with <Parsable>::parse vs Parsed::parse_item. With Parsed::parse_item it makes sense that it returns the first viable result, even if another option could consume more input, but for <Parsable>::parse, having remaining input is an error, so it would make more sense to return the first non-error result in that case. This would mean that making <Parsable>::parse work as (at least I would assume) intended, it would overload the meaning of FormatItem::First depending on the context in which you use it. This is also already true for the other FormatItem variants, though it's more subtle for them. So I'd say it's a question of if this overloading is undesirable enough to resist having special handling in the <Parsable>::parse implementation to handle FormatItem::First differently?

I had a go at implementing a fix that uses special behaviour inside <Parsable>::parse for the FormatItem::First (and OwnedFormatItem::First) variants here: https://github.com/tristan/time/commit/00e6d8bdbcd22a507e92efcd15e552639fc27eeb. If I trust the test suite it hasn't broken anything. If you deem it a workable solution I'm happy to make it a PR and address any structure/naming/testing points.

Otherwise it might be enough to simply add some documentation about the limitations of using FormatItem::First with <Parsable>::parse so that it's obvious for anyone stumbling across it (like i did) that ordering the items to remove the potential for premature UnexpectedTrailingCharacters is important.

jhpratt commented 1 year ago

e.g. the well_known items?

That's actually impossible, as they are their own types.

This would mean that making <Parsable>::parse work as (at least I would assume) intended

The current behavior is what was intended. Parsing is commonly performed on input that has text both before and after the value, and that is what I intended to facilitate.

This is also already true for the other FormatItem variants, though it's more subtle for them.

Only to the extent that the behavior is specified for formatting/parsing. There is no change of behavior depending on which method is used to invoke it.

Otherwise it might be enough to simply add some documentation about the limitations of using FormatItem::First with <Parsable>::parse so that it's obvious for anyone stumbling across it (like i did) that ordering the items to remove the potential for premature UnexpectedTrailingCharacters is important.

This can be done regardless of any changes elsewhere.


My original thinking was to have it be a modifier on [first]. Thinking about it again, I see an alternate path forward that's more flexible.

All that has to be done is introduce a component that indicates it's the end of the input. That, by itself, would permit the various use cases. The primary question I have is what it should be named. It will be quite easy to implement — it needs a new variant on error::ParseFromDescription and a declaration of the component itself.

tristan commented 1 year ago

I'm nowhere invested or opinionated enough to really push for one direction or the other :sweat_smile:. I wouldn't have had anything to say at all had I not been the one who posted the issue. And I'm certainly not familiar enough with all the possible use cases, so you wont get any flak from me for doing what you think is best. but I'll try give a view from a casual user of the API (i.e. someone who generally just doing datetime parsing of plain strings).

The current behavior is what was intended. Parsing is commonly performed on input that has text both before and after the value, and that is what I intended to facilitate.

I struggle a bit with this in the context of <Parsable>::parse (or maybe I should be more specific and say <parsable::sealed::Sealed>::parse, since there is not actually a direct <Parsable>::parse, or even PrimitiveDateTime::parse or OffsetDateTime::parse since those are my usual entry points to the API, but i'll continue to use <Parsable>::parse for consistency and brevity) . For example, if I take something like:

PrimitiveDateTime::parse(
    "2021-01-02 00:00",
    &FormatItem::Compound(&fd::parse("[year]-[month]-[day] [hour]:[minute]")?),
);

and try to add anything to either the front or the back of the input string, I'll get a Result::Err, with no way to receive a potential result contained within. And as far as I can tell (based on some not so thorough grepping), the current implementation of <Parsable>::parse will always fail if there is extra text on either side of the input. I'm probably missing something, as these DateTime types use <Parsable>::parse_date_time which might have different expected behaviour with respect to extra input? though in that case i might expect the UnexpectedTrailingCharacters to come from that function, or the try_into() for the specific type, rather than the <Parsable>::parse function (but that also difficult since the information about how many characters have been consumed is not exposed in the <Parsable>::parse api.

But from a user perspective, things still work as expected in most cases PrimitiveDateTime::parse (and the other types that have a parse method) are failing if there is unexpected characters in the input, which fits with the experience of other APIs from other languages. It's only the case with FormatItem::First, which is already quite a unique API to have and maybe wasn't originally designed to be used in contexts like i've used (it was quite by chance I found it and decided to try use it vs manually looping over my possible cases to find the first one that matched), where it's not quite behaving as I expected. If an extra modifier to specify that a format description must consume the whole input existed when I came across the issue, I probably would have used it and moved on. But that is also assuming I found reference to the modifier before I found that rearranging the options also fixed it. I wouldn't immediately think that there would exist something to mark that a format description should consume the whole input, since with my simplistic use case, and my experience with other datetime parsing libraries in other languages I always expect the whole string to match.

jhpratt commented 1 year ago

An [end] component has been introduced, signaling the end of input. It will fail if there is any input remaining. For the example in your original issue, if you append [end] to each of the format_description! invocations, it will output what you're looking for.