Open davidamichelson opened 4 years ago
@davidamichelson @wlpotter is this a result of oXygen or a result of eXist export? It looks to me like at some point the file was 'pretty-printed' in oXygen, this shouldn't change meaningful whitespace.
@wsalesky thanks, we would like to use regex to find and correct all cases where there is a missing "." at the end of either //desc() or //desc/quote() but we can't seem to get around the new lines in Oxygen. Any ideas?
The problem was caused by Oxygen we think
What is the regex you are trying?
@wsalesky Let's save this for later.
When we do look at it, record 2294 is a good example
<desc type="abstract" xml:lang="en" xml:id="abstract2994-1">A region between <ref target="https://bqgazetteer.bethmardutho.org/place/2970"><placeName ref="http://syriaca.org/place/2970">al-Ray</placeName></ref> and <ref target="https://bqgazetteer.bethmardutho.org/place/2997"><placeName ref="http://syriaca.org/place/2997">Naysābūr</placeName></ref>, around <ref target="https://bqgazetteer.bethmardutho.org/place/2995"><placeName ref="http://syriaca.org/place/2995">al-Dāmaghān</placeName></ref>
</desc>
<desc xml:lang="en">
<quote source="#bib2994-4">[A] small province of
mediaeval Islamic Persia, lying to the south of the Alburz chain <choice>
<corr>watershed</corr>
<sic>watershd</sic>
</choice> and
extending into the northern fringes of the Das̲h̲t-i Kavīr.</quote>
</desc>
@wsalesky I think the default settings in Oxygen is doing something odd to the spacing/indent in our files.
See for example: https://github.com/srophe/bethqatraye-data/blob/master/data/places/tei/143.xml#L138-L139
Why is this desc text node splitting into a new line like that?
Or why is the closing /desc on a new line here: https://github.com/srophe/bethqatraye-data/blob/master/data/places/tei/143.xml#L142-L143
Anyway, we would like to write some find and replace scripts using regex to clean this data up, but are having trouble because of the spacing. Any ideas on what is going on?