plazi / Plazi-Parser

GNU General Public License v2.0
0 stars 0 forks source link

Sofia Outcome To Deliver (TaxPub Parser Adaptation) #1

Open pdatascience opened 7 years ago

pdatascience commented 7 years ago

@teodorgeorgiev

Changes Needed to Convert Plazi-Returned XML into Validaning NLM XML

Add DTD

Insert the following line before <article ...>:

<!DOCTYPE article PUBLIC "-//TaxonX//DTD Taxonomic Treatment Publishing DTD v0 20100105//EN" "../../nlm/tax-treatment-NS0.dtd">

materialsCitation is not a valid tag

Instead, use tp:material-citation and remove all attributes.

OR

Use ` and remove all other attributes.

named-content tags have only one attribute: content-type

Delete superflous attributes from named-content tags.

E.g.

<named-content content-type="dwc:country" name="Brazil">BRAZIL</named-content>

becomes

<named-content content-type="dwc:country">BRAZIL</named-content>

Similarly,

<named-content content-type="dwc:locality" country="Brazil" name="Tijuca Forest" stateProvince="Rio de Janeiro">Tijuca Forest</named-content>

becomes

<named-content content-type="dwc:locality">Tijuca Forest</named-content>

date should not be used inside a material citation or outside

  1. named-content cannot be a child of date.
  2. The inverse also cannot hold.
  3. It is not possible to trick them via an intermediate tag such as p.

Therefore,

<date value="1951-10-31">
<named-content content-type="dwc:verbatimEventDate" value="1951-10-31">31.x.1951</named-content>
</date>

becomes

<named-content content-type="dwc:verbatimEventDate">31.x.1951</named-content>

date is not a child of anything useful and will not be validated almost anywhere. Therefore, it should be used except as a child of

element-citation history mixed-citation product related-article related-object tp:collecting-event

In that case it must follow the pattern (((day?,month?)|season)?,year?,string-date?). This means that no punctuation ought to exist between the tokens!

Therefore, in the text

<date value="2014-10-30">30.x.2014</date>

becomes

30.x.2014

Formatting tags such as italic and bold must be removed from material citations

Remove those tags if you use <tp:material-citation>. You can keep them if you use <named-content content-type='material citation'>

quantity does not exist

Remove quantity. E.g.

<quantity metricMagnitude="2" metricUnit="m" metricValue="9.0" unit="m" value="900.0">900 m</quantity>

becomes

900 m

history does not exist

Remove history.

content-type='institution' should be dwc:institutionCode

As per DwC standard, dwc:insitutionCode is "The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record."

So

<named-content content-type="institution">Museu Nacional da Universidade Federal do Rio de Janeiro</named-content>

should be

<named-content content-type="dwc:institutionCode">Museu Nacional da Universidade Federal do Rio de Janeiro</named-content>

Similary dwc:institutional_code should not exist and

MNRJ

should be

MNRJ

Page breaks are foobared

Replace

&lt;!--PageBreak--&gt;

with

<!--PageBreak-->
pdatascience commented 7 years ago

@myrmoteras how to ping Guido on this issue? Can you please assigne to Guido? I am having problems with doing this. (probably not rights)

myrmoteras commented 7 years ago

@gsautter can you look at this?

myrmoteras commented 7 years ago

@pdatascience can you try to have one issue in one issue, which allows to close one after the other? Appreciated

myrmoteras commented 7 years ago

@pdatascience can you assign now an issue to Guido?

pdatascience commented 7 years ago

@pdatascience can you try to have one issue in one issue, which allows to close one after the other? Appreciated

@myrmoteras don't get this :(

pdatascience commented 7 years ago

This is what Guido requested: all the changes needed (they need to be taken as a package as they denpend on each other) to fully integrate the new endpoint that Guido created in Sofia with TaxPub.

myrmoteras commented 7 years ago

fine - it looks like many individual tasks to be solved. Between Guido and You

pdatascience commented 7 years ago

it's all one task: adapt the Enedpoint to TaxPub, there is little point in splitting in little tasks, i think Guido will understand

gsautter commented 7 years ago

I do understand ;-)

And if there are multiple adjustments to be made to schema mapping, I'm perfectly happy to have that in one task.

The one thing I hate happening is a task re-opened with a pretty much new issue after the initial issue was resolved. That's because then you cannot close the task after resolving the original issue. Plus, such multi-issue tasks tend to grow pretty lengthy and become hard to overview. Plus, where would you put feedback for the solution of the original issue (in case there are problems) if the ticket has veered off to other issues?

pdatascience commented 7 years ago

@gsautter @myrmoteras I am happy to split up the task any way you like: these are my findings of what is needed to convert the output of the taxpub tagger as of now to make it validate :) cheers, vic