srophe / bethqatraye-data

Data repository for bethqatraye
0 stars 0 forks source link

Need help with whitespace, indent, new lines in XML/Oxygen #53

Open davidamichelson opened 4 years ago

davidamichelson commented 4 years ago

@wsalesky I think the default settings in Oxygen is doing something odd to the spacing/indent in our files.

See for example: https://github.com/srophe/bethqatraye-data/blob/master/data/places/tei/143.xml#L138-L139

Why is this desc text node splitting into a new line like that?

Or why is the closing /desc on a new line here: https://github.com/srophe/bethqatraye-data/blob/master/data/places/tei/143.xml#L142-L143

Anyway, we would like to write some find and replace scripts using regex to clean this data up, but are having trouble because of the spacing. Any ideas on what is going on?

wsalesky commented 4 years ago

@davidamichelson @wlpotter is this a result of oXygen or a result of eXist export? It looks to me like at some point the file was 'pretty-printed' in oXygen, this shouldn't change meaningful whitespace.

davidamichelson commented 4 years ago

@wsalesky thanks, we would like to use regex to find and correct all cases where there is a missing "." at the end of either //desc() or //desc/quote() but we can't seem to get around the new lines in Oxygen. Any ideas?

davidamichelson commented 4 years ago

The problem was caused by Oxygen we think

wsalesky commented 4 years ago

What is the regex you are trying?

davidamichelson commented 4 years ago

@wsalesky Let's save this for later.

When we do look at it, record 2294 is a good example

               <desc type="abstract" xml:lang="en" xml:id="abstract2994-1">A region between <ref target="https://bqgazetteer.bethmardutho.org/place/2970"><placeName ref="http://syriaca.org/place/2970">al-Ray</placeName></ref> and <ref target="https://bqgazetteer.bethmardutho.org/place/2997"><placeName ref="http://syriaca.org/place/2997">Naysābūr</placeName></ref>, around <ref target="https://bqgazetteer.bethmardutho.org/place/2995"><placeName ref="http://syriaca.org/place/2995">al-Dāmaghān</placeName></ref>
               </desc>
               <desc xml:lang="en">
                        <quote source="#bib2994-4">[A] small province of
                  mediaeval Islamic Persia, lying to the south of the Alburz chain <choice>
                                <corr>watershed</corr>
                                <sic>watershd</sic>
                            </choice> and
                  extending into the northern fringes of the Das̲h̲t-i Kavīr.</quote>
                    </desc>