tafia / quick-xml

Rust high performance xml reader and writer
MIT License
1.22k stars 237 forks source link

Add normalize_line_end for unescape and test #807

Open yorkz1994 opened 1 month ago

yorkz1994 commented 1 month ago

Regarding #806 I added a normalize_line_end function in escape module and related tests. If unescape function is called, then line end will be normalized.

yorkz1994 commented 1 month ago

I think the failed test is because we did the line end normalization after the unescape, but unescape can contain \r, so these new \rs should not be normalized. We have to do the normalization only for data in the not unescape parts.

codecov-commenter commented 1 month ago

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 60.36%. Comparing base (3ebe221) to head (d3ee1ad). Report is 3 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #807 +/- ## ========================================== + Coverage 60.17% 60.36% +0.18% ========================================== Files 41 41 Lines 15958 15980 +22 ========================================== + Hits 9603 9646 +43 + Misses 6355 6334 -21 ``` | [Flag](https://app.codecov.io/gh/tafia/quick-xml/pull/807/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/tafia/quick-xml/pull/807/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) | `60.36% <100.00%> (+0.18%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

yorkz1994 commented 1 month ago

Good. But I would also like to see the tests when use API of Attribute and BytesText, which are listed here.

Sorry, don't know what do you want me to do. I am not familiar with XML specification on attributes. What I did is just do normalization of the line end before your further unescape work. So it removes \r before unescape means it is just like no such character in your input at all, so your previous logic should continue to work. Now all your previous cases passed and the added case for testing normalization also works. I don't see anything I can do.

Mingun commented 1 month ago

I like just see tests that will call unescape family of methods on Attribute and BytesText which is get from the reader and check that the result does not contain \r. Something like:

// XML with \n \r\n and \r style newlines in various places
const XML: &str = "...";

let mut reader = Reader::from_str(XML);

match reader.read_event().unwrap() {
  Event::Start(event) => {
    let iter = event.attributes();
    let a = iter.next().unwrap();
    #[cfg(not(feature = "encoding"))]
    assert_eq!(a.unescape_value(), "...");
    assert_eq!(a.decode_and_unescape_value(), "...");
  }
  event => panic!("Expected Start, found {:?}", event),
}

match reader.read_event().unwrap() {
  Event::Text(event) => assert_eq!(event.unescape(), "..."),
  event => panic!("Expected Text, found {:?}", event),
}
yorkz1994 commented 1 month ago

Like I said, I am not familiar with XML specification, so I don't know where is the proper places to put \r in it and what should be expected after unescape if one appears. For example is it valid to put a \r in a attribute key or value, in tag name? If you know better than me how about you try it yourself.

Mingun commented 1 month ago

Just add it everywhere where spaces can occur, we are not not talking about correctness for now (this is another question, we are definitely do not process everything according to the specification). I only want to have a starting point and be sure that this feature worked as assumed when you would use actual API.

yorkz1994 commented 1 month ago

Could you provide such input data. I don't want to create such invalid XML. Normally I can think is that the \r will only appear in BytesText due to line end diffs in OS.

Mingun commented 1 month ago
<root attribute="\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root>
yorkz1994 commented 1 month ago

f970370 This commit is due to forgot to normalize if input has nothing to unescape.

I add a case from your input, don't know the result is your expected:

#[test]
fn line_ends() {
    const XML: &str = "<root attribute=\"\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n\">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root>";
    let mut reader = Reader::from_str(XML);
    match reader.read_event().unwrap() {
        Event::Start(event) => {
            let mut iter = event.attributes();
            let a = iter.next().unwrap().unwrap();
            #[cfg(not(feature = "encoding"))]
            assert_eq!(
                a.unescape_value().unwrap(),
                "\n\n\nvalue1\n\n\nvalue2\n\n\n"
            );
            assert_eq!(
                a.decode_and_unescape_value(reader.decoder()).unwrap(),
                "\n\n\nvalue1\n\n\nvalue2\n\n\n"
            );
        }
        event => panic!("Expected Start, found {:?}", event),
    }
    match reader.read_event().unwrap() {
        Event::Text(event) => {
            assert_eq!(event.unescape().unwrap(), "\n\n\nvalue3\n\n\nvalue4\n\n\n")
        }
        event => panic!("Expected Text, found {:?}", event),
    }
}

There is a case in serde-se.rs always fail. Don't know if we should modify serde implementation?

serialize_as!(tuple:
        // Use to_string() to get owned type that is required for deserialization
        ("<\"&'>".to_string(), "with\t\r\n spaces", 3usize)
        => "<root>&lt;\"&amp;'&gt;</root>\
            <root>with\t\r\n spaces</root>\
            <root>3</root>");
---- with_root::tuple stdout ----
thread 'with_root::tuple' panicked at tests/serde-se.rs:1956:5:
deserialization roundtrip: Custom("invalid type: string \"with\\t\\n spaces\", expected a borrowed string")
yorkz1994 commented 1 month ago

The failed test is because there is \r in the roundtrip test. It is impossible to have equal serialize and deserialize due to line end normalization. Therefore I have to remove the \r in roundtrip test to get the case pass.

yorkz1994 commented 1 month ago

I did more line ends normalization. Also changed the normalization implementation to use iterator to avoid extra allocation during normalization.