martinblech / xmltodict

Python module that makes working with XML feel like you are working with JSON
MIT License
5.49k stars 462 forks source link

quoted string value is quoted again #308

Closed adriansev closed 2 years ago

adriansev commented 2 years ago

Hi! I just found this solution for translation of xml to dict and is amazing!! Thanks a lot! I tiny problem that i encountered is that a value like: <lp>"/"</lp> is translated to "lp": "\"/\""

is there any way to not quote the string and just use it as is? Thanks a lot!

horvatha commented 2 years ago

As / value is different than "/" the dict will be different. I think, it shouldn't be the package's task to handle your special need. But nothing prevents you to remove the quotation marks.

def remove_quotation(path):
   if '"' == path[0] == path[-1]:
       return path[1:-1]
   return path
adriansev commented 2 years ago

sorry that i was not clear .. i did not mean to remove quotation, but the double quotation and escaping i.e.: situation now: <lp>"/"</lp> --> "lp": "\"/\"" what IMHO should happen : <lp>"/"</lp> --> "lp": "/" Is this wrong? Thanks a lot!

horvatha commented 2 years ago

I'm only a user of the package, and the one above is just my opinion, but I still don't understand the general use case you expect from the package. Seeing your awesome achievements on Github I'm sure you know that in Python "/" is the string representation of the one-character-string: / and "\"/\"" is the representation of the 3-character-string "/" (an other representation is '"/"'), so using my function on the second gives you the first:

remove_quotation("\"/\"")    --> "/"
print("\"/\"")   # prints "/"
print("/")   # prints /

But I may miss the point and you can wait for the answer of somebody who is smarter and/or more involved in the package's development.

SamStephens commented 2 years ago

@adriansev:

sorry that i was not clear .. i did not mean to remove quotation, but the double quotation and escaping i.e.: situation now: <lp>"/"</lp> --> "lp": "\"/\"" what IMHO should happen : <lp>"/"</lp> --> "lp": "/" Is this wrong? Thanks a lot!

I'm with @horvatha here; the behavior you are seeing is correct. I'd expect to see <lp>/</lp> map to { "lp": "/" } as an lp element that contains the string /. And then <lp>"/"</lp> map to { "lp": "\"/\"" }, an lp element containing the string "/". The double quotes inside the lp have no special meaning in XML, they're simply part of the string inside the lp element.

adriansev commented 2 years ago

@SamStephens

@adriansev:

sorry that i was not clear .. i did not mean to remove quotation, but the double quotation and escaping i.e.: situation now: <lp>"/"</lp> --> "lp": "\"/\"" what IMHO should happen : <lp>"/"</lp> --> "lp": "/" Is this wrong? Thanks a lot!

I'm with @horvatha here; the behavior you are seeing is correct. I'd expect to see <lp>/</lp> map to { "lp": "/" } as an lp element that contains the string /. And then <lp>"/"</lp> map to { "lp": "\"/\"" }, an lp element containing the string "/". The double quotes inside the lp have no special meaning in XML, they're simply part of the string inside the lp element.

while i do agree with you from the point of view of generic processing, but my point was for the underlying meaning/context : while a bare string should be quoted (so a bare / should be converted to "/"), in this case the string is already quoted so it is possible to be used as is. also, i state all this because i fail to see/imagine a use-case where a already quoted string needs to be processed with quotes included .. what would be the purpose of that?

Thanks a lot!

SamStephens commented 2 years ago

@adriansev

i fail to see/imagine a use-case where a already quoted string needs to be processed with quotes included

Any use case in which quotes are a significant part of the data you are processing.

For example, imagine a tool that tokenizes computer code. If I write

print "hello"

It's important that is tokenised as

<token>print</token>
<token>"hello"</token>

Treating that as

<token>print</token>
<token>hello</token>

Means

print hello

Which is something quite different.

adriansev commented 2 years ago

@SamStephens well, despite the forceful use-case (IMHO) you are right about keeping the consistency of the data: if the xml field is "my_string" then at the translation to dict it should be kept as "key": "\"my_string\"" But this is not always the case: if i keep data (not code) in xml i will want that my_dict["key"] to return my_string and not \"my_string\" In the mean time i can take care to strip the extraneous \" from the dict values, but maybe (hoping that i'm not the only one needing this) a flag can/could be added? something like: detect_quotes with the effect of if string_value[0] == string_value[-1] == '"': do_not_add_quotes Am I really unique in my use-case? Thanks a lot!

SamStephens commented 2 years ago

@adriansev the key thing here is that quotes inside tags have no special meaning as far as the XML specification goes. They are just characters, and as far as the XML specification says, they are significant and should be treated as data, not as a container for a string. As far as the XML spec goes, you would represent the dict { "key": "my_string" } as <key>my_string</key>, not <key>"my_string"</key>. The fact you have strings inside your XML that include quotes that are not actually part of the data within that tag is not something an XML parser should handle, as that's not part of the XML specification; that's a behavior of your application that is separate to the XML specification.

As far as your request to add a special case detect_quotes flag into this library, the problem is that this then sets a precedent to extend the parser with features outside of the XML specification. What if another user then comes along and says that the data in their XML tags has single quote characters that need to be ignored? Or that the data inside the XML tags is URL encoded, and there should be an url_decode flag to ask the parser to decode URL encoded strings? Suddenly we end up with an explosion of features to deal with idiosyncrasies that are not actually part of the XML specification, that are dealing with the idiosyncrasies of the data stored within the XML.

My suggestion to you is to treat this as a feature of the data you are handling, rather than something that should be a feature of an XML parser. Build a method or module to take the output of XML to dict mapping, and remove the quotes from the string values within the dictionary.

adriansev commented 2 years ago

@SamStephens ok, got it, thanks a lot for your time and info!