sanand0 / xmljson

xmlsjon converts XML into Python dictionary structures (trees, like in JSON) and vice-versa.
MIT License
121 stars 33 forks source link

Node values (text) broken by sub-nodes #38

Open helmersl opened 5 years ago

helmersl commented 5 years ago

Hi,

I'm using xml2json to parse publications and have the following problem:

If the text in the abstract node contains XML-tags, the text following these seub-nodes is just ignored in the JSON output, so the xml-example:

     <Abstract>
          <AbstractText>As photosynthetic prokaryotes, cyanobacteria can directly convert CO<sub>2</sub> to organic compounds and grow rapidly using sunlight as the sole source of energy. The direct biosynthesis of chemicals from CO<sub>2</sub> and sunlight in cyanobacteria is therefore theoretically more attractive than using glucose as carbon source in heterotrophic bacteria. To date, more than 20 different target chemicals have been synthesized from CO<sub>2</sub> in cyanobacteria. However, the yield and productivity of the constructed strains is about 100-fold lower than what can be obtained using heterotrophic bacteria, and only a few products reached the gram level. The main bottleneck in optimizing cyanobacterial cell factories is the relative complexity of the metabolism of photoautotrophic bacteria. In heterotrophic bacteria, energy metabolism is integrated with the carbon metabolism, so that glucose can provide both energy and carbon for the synthesis of target chemicals. By contrast, the energy and carbon metabolism of cyanobacteria are separated. First, solar energy is converted into chemical energy and reducing power via the light reactions of photosynthesis. Subsequently, CO<sub>2</sub> is reduced to organic compounds using this chemical energy and reducing power. Finally, the reduced CO<sub>2</sub> provides the carbon source and chemical energy for the synthesis of target chemicals and cell growth. Consequently, the unique nature of the cyanobacterial energy and carbon metabolism determines the specific metabolic engineering strategies required for these organisms. In this chapter, we will describe the specific characteristics of cyanobacteria regarding their metabolism of carbon and energy, summarize and analyze the specific strategies for the production of chemicals in cyanobacteria, and propose metabolic engineering strategies which may be most suitable for cyanobacteria.</AbstractText>
        </Abstract>

is converted to JSON as:

('Abstract',
                                          OrderedDict([('AbstractText',
                                                        OrderedDict([('$',
                                                                      'As photosynthetic prokaryotes, cyanobacteria can directly convert CO'),
                                                                     ('sub',
                                                                      [OrderedDict([('$',
                                                                                     2)]),
                                                                       OrderedDict([('$',
                                                                                     2)]),
                                                                       OrderedDict([('$',
                                                                                     2)]),
                                                                       OrderedDict([('$',
                                                                                     2)]),
                                                                       OrderedDict([('$',
                                                                                     2)])])]))]))

Due to the sub-tags.

Is there a way to fix this problem?

Thanks! Lea

sanand0 commented 5 years ago

@helmersl What would you like the output to be? Based on that, I can suggest if an alternate convention might help.

However, we may have a bigger problem. The XML <AbstractText>head<sub>text</sub>tail</AbstractText> has the following parts:

  1. tree.tag == 'AbstractText'
  2. tree.text == 'head'
  3. tree.getchildren()[0].tag == 'sub'
  4. tree.getchildren()[0].text == 'text'
  5. tree.getchildren()[0].tail == 'tail'

The last bit -- the "tail" -- is not converted to JSON in any of the conventions I know of. If we need to preserve that, we'll need to research a bit.

But for now, what would you like the output to be? Let's use that as a starting point, perhaps?

mikessut commented 5 years ago

My use case is very similar. Probably the most desired result would be to add a special json key (like '$') for the tail.

For my use case, I want to be able to recover the XML in its original form (for example, using the badgerfish.etree method).

xgodon commented 5 years ago

I have the same problem. Maybe a list alternating text and objects could provide a good solution

this is a niceproject

text: ['this is a',{b: 'nice'},'project']