libero / publisher

The starting point for raising issues for Libero Publisher
MIT License
16 stars 4 forks source link

Provide sample/test data for Research Categories #256

Closed fred-atherden closed 5 years ago

fred-atherden commented 5 years ago

Provide JATS XML which can be used as test data for the jats-ingester with respect to scholarly-content-detail.

fred-atherden commented 5 years ago

I have edited versions of the IJM and eLife XML currently on the demo (edited because they currently use a separate convention, which will be changed going forward).

Hindawi content can continue to be used as it currently is (3914828 and 7292974).

I also have (an edited) bioRxiv sample, and can provide more if needed

@GiancarloFusiello to provide details on whether that's enough and where the content should be stored.

GiancarloFusiello commented 5 years ago

@FAtherden-eLife Like we did for retrieving the article id, it would be good to have a series of strategies and a series of test cases that cover known/supported xml formatting. This is how we did this for article ids https://github.com/libero/jats-ingester/blob/master/tests/test_xml_jats.py#L47

So to summerise, I need a list of XPaths to retrieve the data and a list of minimal xml examples I can use to test these cases. Thanks.

fred-atherden commented 5 years ago

@GiancarloFusiello, There's just one XPath which is: //*:article-categories/*:subj-group[not(@subj-group-type="heading")]/*:subject[1]

Here is some sample content (let me know if you need more):

Expect 'Cancer Biology'

<article>
    <front>
        <article-meta>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group subj-group-type="subjects">
                    <subject>Cancer Biology</subject>
                </subj-group>
            </article-categories>
        </article-meta>
    </front>
</article>

Expect 'Cancer Biology'

<article>
    <front>
        <article-meta>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group subj-group-type="subjects">
                    <subject>Cancer Biology</subject>
                    <subject>General Economics</subject>
                </subj-group>
            </article-categories>
        </article-meta>
    </front>
</article>

Expect 'General Economics'

<article>
    <front>
        <article-meta>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group>
                    <subject>General Economics</subject>
                    <subject>Cancer Biology</subject>
                </subj-group>
            </article-categories>
        </article-meta>
    </front>
</article>

Expect 'Cancer Biology' and 'General Economics'

<article>
    <front>
        <article-meta>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group subj-group-type="subjects">
                    <subject>Cancer Biology</subject>
                </subj-group>
                <subj-group subj-group-type="subjects">
                    <subject>General Economics</subject>
                </subj-group>
            </article-categories>
        </article-meta>
    </front>
</article>

Expect 'Cancer Biology' and 'General Economics'

<article>
    <front>
        <article-meta>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group subj-group-type="subjects">
                    <subject>Cancer Biology</subject>
                </subj-group>
                <subj-group>
                    <subject>General Economics</subject>
                </subj-group>
            </article-categories>
        </article-meta>
    </front>
</article>

Expect nothing

<article>
    <front>
        <article-meta>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
            </article-categories>
        </article-meta>
    </front>
</article>
fred-atherden commented 5 years ago

Output of #242 will determine what ids should be generated for each of these. I can update here if needed.

fred-atherden commented 5 years ago

Adding one more which includes more than 2 subjects (we're expecting 0 to N)

Expect 'Cancer Biology', 'Data', and 'Housing' and 'General Economics'

<article>
    <front>
        <article-meta>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group subj-group-type="subjects">
                    <subject>Cancer Biology</subject>
                </subj-group>
                <subj-group>
                    <subject>Data</subject>
                </subj-group>
                <subj-group subj-group-type="subjects">
                    <subject>Housing</subject>
                </subj-group>
                <subj-group>
                    <subject>General Economics</subject>
                </subj-group>
            </article-categories>
        </article-meta>
    </front>
</article>
fred-atherden commented 5 years ago

More complex test case:

Expect ['Cancer Biology', 'Data', 'Ecology', 'α^']

<article xmlns:mml="http://www.w3.org/1998/Math/MathML">
    <front>
        <article-meta>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group subj-group-type="subjects">
                    <subject><italic>Cancer Biology</italic></subject>
                </subj-group>
                <subj-group>
                    <subject>Data</subject>
                    <subj-group subj-group-type="subjects">
                        <subject>Housing</subject>
                        <subj-group>
                            <subject>General Economics</subject>
                        </subj-group>
                    </subj-group>
                </subj-group>
                <subj-group subj-group-type="level-1">
                    <subject>Ecology</subject>
                    <subject>Genetics and Genomics</subject>
                    <subj-group subj-group-type="level-2">
                        <subject>Evolutionary Biology</subject>
                        <subj-group subj-group-type="level-3">
                            <subject>General Economics</subject>
                            <subject><bold>Plant Biology</bold></subject>
                        </subj-group>
                    </subj-group>
                </subj-group>
                <subj-group subj-group-type="any-value">
                    <subject><mml:math id="i1" display="inline"><mml:mover accent="true"><mml:mi>α</mml:mi><mml:mo>^</mml:mo></mml:mover></mml:math></subject>
                    <subj-group subj-group-type="nested-sub">
                        <subject><mml:math id="i2" display="inline"><mml:mover accent="true"><mml:mi>β</mml:mi><mml:mo>^</mml:mo></mml:mover></mml:math></subject>
                    </subj-group>
                </subj-group>
            </article-categories>
        </article-meta>
    </front>
</article>