tafia / quick-xml

Rust high performance xml reader and writer
MIT License
1.17k stars 231 forks source link

Help deserialize mixed tags and string in body $value (html text formatting) #257

Open Rudo2204 opened 3 years ago

Rudo2204 commented 3 years ago

I'm trying to deserialize some dictionary defitnitions and came across this one which contains mixed multiple tags with normal string (html text formatting).

<div style="margin-left:2em"><b>1</b> 〔学業・技術などの能力判定〕 an examination; a test; 《口》 an exam; 《米》 a quiz 《<i>pl</i>. quizzes》.</div>

I looked around in serde-xml-rs tests and tried this solution which seems to be close but it doesn't quite work

#[derive(Debug, Deserialize, PartialEq)]
struct DivDefinition {
    style: String,
    #[serde(rename = "$value")]
    definition: Vec<MyEnum>,
}

#[derive(Debug, Deserialize, PartialEq)]
enum MyEnum {
    b(String),
    #[serde(rename = "$value")]
    String,
    i(String),
}

The error I'm getting is:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Custom("unknown variant `〔学業・技術などの能力判定〕 an examination; a test; 《口》 an exam; 《米》 a quiz 《`, expected one of `b`, `$value`, `i`")'

I can make it work for now by not using MyEnum and just use definition: Vec<String>, but then I wouldn't know which text is bold and which is italic. How can I properly deserialize this?

dralley commented 1 year ago

Whoever picks this up, consider starting from https://github.com/tafia/quick-xml/pull/511

lkirkwood commented 1 year ago

Has anybody found a workaround for this? I am having the same issue.

lkirkwood commented 9 months ago

You can close this. Don't know when it was fixed but the original example works now with minor edits:

#[derive(Debug, Deserialize, PartialEq)]
struct DivDefinition {
    #[serde(rename = "@style")]
    style: String,
    #[serde(rename = "$value")]
    definition: Vec<MyEnum>,
}

#[derive(Debug, Deserialize, PartialEq)]
enum MyEnum {
    b(String),
    #[serde(rename = "$text")]
    String,
    i(String),
}
enricozb commented 8 months ago

Thoughts on this idea? https://github.com/enricozb/quick-xml/commit/7b4b3f851a50ae9dbb45d54edfdc7c2374ec59d0

Specifically, I'm adding a new special field name $raw that can only deserialize into a String, and just writes all events, until the expected end event, into a string.

It lets you do stuff like this:

const xml: &str = r#"
  <who-cares>
    <foo property="value">
      test
      <bar><bii/><int>1</int></bar>
      test
      <baz/>
    </foo>
  </who-cares>
"#;

#[derive(Deserialize, Debug)]
struct Root {
  #[serde(rename = "$raw")]
  value: String,
}

let root = quick_xml::de::from_str::<Root>(&xml).unwrap();

println!("parsed: {root:?}");

This prints

parsed: Root { value: "<foo property=\"value\">test<bar><bii></bii><int>1</int></bar>test<baz></baz></foo>" }

One of the problems with this approach is that it doesn't save exactly what was in the XML file. This would be ideal because we could likely avoid any allocations, like serde_json::value::RawValue, and we would preserve formatting, and not trim spaces.

Another issue is that empty tags <bii/> get converted to <bii></bii> as that is how the events come in.

It's possible my initial idea could be fixed up to disable trimming temporarily of the reader during raw_string use.

Mingun commented 8 months ago

Deserialization of RawValue in serde_json implemented as deserialization of a newtype with a special name: https://github.com/serde-rs/json/blob/0131ac68212e8094bd14ee618587d731b4f9a68b/src/de.rs#L1711-L1724

The deserializer then returns data from it's own buffer of directly from input string, depending on what type is deserialized (Box<RawValue> or &RawValue). We can do the same because we have read_text, but right now only for borrowing reader. We need to implement #483 in order to implement read_text_into needed for owned reader.

enricozb commented 8 months ago

Got it. I saw that private newtype name, but wasn't sure why it mattered. I see now that the json deserializer looks for this tag. I'll take a stab at this.

enricozb commented 8 months ago

Additionally, I'm not sure if we should capture the surrounding tags or not. What should this print:

struct AnyName {
  root: RawValue,
}

const xml: &str = "
  <root>
    <some/><inner/><tags/>
  </root>
";

let x: AnyName = from_str(xml)?;

println!("{}", x.value);

Should this print

<root>
  <some/><inner/><tags/>
</root>

or

<some/><inner/><tags/>
NuSkooler commented 1 month ago

Hi, I'm trying to track down a way to de-serialize unknown/arbitrary data under a specific tag and found my way here. Is this currently possible in any form?

I have something like this:

<root>
  <someTag> <!-- I am only aware of this tag -->
    <arbitraryTag1>
      <arbitraryTag2>...stuff...</arbitraryTag2>
      <anotherArbitraryTag>foo</anotherArbitraryTag>
    </arbitraryTag1>
  </someTag>
</root>

I simply need everything under someTag as a HashMap<String,String> ideally.

Mingun commented 1 month ago

If ...stuff and foo would contain only textual data, CDATA sections, comments (would be skipped) and processing instructions (also skipped), then I think it should be possible today. If them can contain markup (i.e. nested tags), then you cannot read them to String.

NuSkooler commented 1 month ago

@Mingun Thanks for the quick reply! I updated my example, it was missing some data.

Basically, under someTag, there is a nested structure starting with arbitraryTag1, but always key-value tags from there. I'd like to capture the name of arbitraryTag1 in some way, and HashMap<String, String> for the key-values.

Mingun commented 1 month ago

So in your example you expect HashMap with

Both are impossible right now. The first because we cannot capture markup to the String, the second because we (probably) cannot capture tag name as a value (there a separate issue for that -- #778).

NuSkooler commented 1 month ago

@Mingun thanks, the 2nd example is what I'm after.

Can you think of any workarounds?

NuSkooler commented 1 month ago

@Mingun Apologies for the "bump", I'm trying to determine where this stands exactly. #778 mentions something works, but I can't find it.

Ideally, I'm after the ability to capture arbitrary nested XML, similar to what a HashMap<String, serde_json::Value> can achieve with JSON (in fact, I need to turn them into JSON after)

I'm not 100% clear if this is the correct ticket, #778, or something else.

Thanks again!

Mingun commented 1 month ago

In #383 @alex-semov in the initial post gave a code that looks like what you need. Try experimenting with it. If you don't have to extract the attributes from <arbitraryTag1>, then it looks like it works.

NuSkooler commented 3 weeks ago

In #383 @alex-semov in the initial post gave a code that looks like what you need. Try experimenting with it. If you don't have to extract the attributes from <arbitraryTag1>, then it looks like it works.

Unfortunately we need to extract/convert arbitrary XML into a JSON representation in our case. Something like:

<xml>
  <foo><bar>123</bar></foo>
  <foobar someattr="thing"/>
  <bazfoo anotherattr="stuff">bazzle</bazfoo>
</xml>

to

{
  "foo": {
    "bar": 123
  },
  "foobar": {
    "@someattr": "thing"
  },
  "bazfoo": {
    "@anotherattr": "stuff",
    "@value": "bazzle"
  }
}

JSON structure is just an example, we just need a way to do it in some way.